CSC/ECE 506 Fall 2007/wiki 2 5 2281: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Objective ==
== Objective ==
Cache sizes in multicore architectures Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs. <br>
To create a table having details of the current multi-core architecture processors and their internal cache specifications.


{| border="1" cellspacing="0" cellpadding="10" align="center"
== Cache Sizes in multicore architectures ==
Cache Sizes have increased of late in the multicore architectures. Not only are we moving into a situation of designing more levels of cache inside the core, but also into designing caches which are shared among processors, not only inside, but outside the core as well.
 
Several design considerations have to be borne in mind before the actual core-cache design.
Latency is the amount of time taken for the "missed" Address content from the lower memory level into the Cache. If we have multiple levels of cache, this time will reduce significantly. In fact, if we have more levels on-chip, then this latency will further reduce, giving a faster processor performance.
 
The number of times a processor goes outside the chip (off-chip) is also an important design factor. More the number of Lower-level memory accesses, especially off-chip memory accesses, worser is the processor performance. Hence the Associativity of the Cache, Number of levels of Cache, Cache Size, all play a vital role in deciding the ultimate processor performance.
 
Below is a table containing details of the current multi-core processor architectures along with their intricate details like number of levels, cache size, etc.
 
{| border="1" cellspacing="0" cellpadding="5" align="center"
! Multi-core Architecture
! Multi-core Architecture
! Number of levels
! Number of levels
Line 15: Line 25:
| 2
| 2
| -
| -
| 64 Byte L1, 1024 KByte L2
| 64 Byte L1 Cache - Data and Instruction Cache Separated, 1024 KByte L2
| 2 Way Associative ECC Protected L1 Data; 16 Way Associative L2 Cache
| 2 Way Associative ECC Protected L1 Data Cache & Parity Protected Instruction Cache; <br>16 Way Associative Parity Protected L2 Cache
| Two 64 bit operations per 3 cycle latency
| No
| Exclusive cache architecture
|-
| AMD Athlon X2 Dual Core
| 2
| -
| 64 Byte L1 Cache - Data and Instruction Cache Separated, 1024 KByte L2
| 2 Way Associative ECC Protected L1 Data Cache & Parity Protected Instruction Cache; <br>16 Way Associative Parity Protected L2 Cache
| Two 64 bit operations per 3 cycle latency
| Two 64 bit operations per 3 cycle latency
| No
| No
| Exclusive cache architecture
| Exclusive cache architecture
|-
|-
| with
| AMD Turin 64 Mobile
| row2
| 2
| with
| -
| row2
| 64 Kbyte L1; Upto 1MByte of L2 with 512 Kbyte Options
| with
| 2-Way Associative ECC-Protected L1 Data Cache & Parity Protected L1 Instruction Cache; <br>16-Way Associative ECC-Protected L2 Cache
| row2
| Two 64-bit operations per cycle, 3-cycle latency - With advanced branch prediction
| with
| No
| row2
| Exclusive cache architecture—storage
|-
| AMD Sempron Processor
| 2
| -
| 64-Kbyte ECC-Protected L1 Data Cache && Parity-Protected Instruction Cache; <br>256-Kbyte ECC-Protected
L2 Cache
| 2-Way Associative L1 Cache ; 16-Way Associative L2 Cache
| Two 64-bit operations per cycle, 3-cycle latency
| No
| Exclusive cache architecture—storage
|-
| AMD Athlon Duron Processor
| 2
| -
| Integrated 128-Kbyte L1 Cache and an exclusive 64-Kbyte L2 Cache
| -
| -
| No
| Exclsive cache architecture-storage
|-
| AMD Palemo Processor
| 2
| -
| 64 KByte L1 Data Cache & L1 Instruction Cache; <br> Unified 128 or 256 KByte L2 Cache
| -
| -
| No
| Inclusive
|-
| AMD Thoroughbred (TBRED)
| 2
| -
| 64 KByte L1 Data Cache & L1 Instruction Cache; <br> Unified 256 KByte full-speed L2 Cache
| -
| -
| No
| Inclusive
|-
| AMD Barton Processor
| 2
| -
| 64 KByte L1 Data Cache & L1 Instruction Cache; <br> Unified 512 KByte L2 Cache
| -
| -
| No
| -
|-
| AMD Thunderbird
| 2
| -
| -
| 16-Way
| -
| -
| -
|-
| CELL Processor (Playstation3 Processor) Manufactured by TOSHIBA, IBM and SONY
| Power PC Core, which is at the centre of the Cell, contains 2 Levels ; <br> Each of the "surrounding" SPEs have just one level of Cache
| -
| 32 KByte Data Cache + 32 Kbyte Instruction Cache in the Power PC Core ; <br> The surrounding SPEs have 256 Kbyte Unified Cache
| -
| -
| The L2 Cache of the Power PC Core is shared by the surrounding SPEs
| -
|-
| AMD Athlon 64 X2 Dual Core - 4600+
| 2
| -
| 128 KByte L1 Unified Cache ; 512 KByte L2 Unified Cache
| -
| -
| No
| -
|-
| Storm-1 Family by Stream Processors
| 1
| -
| 16 KByte L1 Data / Instruction Cache
| -
| 533 MHz Data Rate between L1 and DDR Memory
| No
| NA
|-
| UltraSPARC IV
| 2
| 128 bytes to 512 bytes
| 64KByte Dual L1 Data Cache; 32KByte L2 extendable up to 16MB
| 2way set associative per core
| -
| No
| -
|-
| UltraSPARC IV+
| 2
| -
| 32 MByte L2 on-chip ; 32 MByte L3 External
| -
| -
| No
| -
|-
| UlatraSPARC T1
| 2
| 4 Banks for L2
| 8 KByte Data Cache ; 16 KByte Instruction Cache; <br> 3 MByte L2 Cache
| 4-Way Set Associative for L1 ; 12-Way Set Associative for L2 Cache
| -
| No
| -
|-
| Intel Pentium D Series
| 3 (L1 + L2 + Execution Trace cache holding the decoded Micro-Ops)
| -
| 2 * 16 KByte of L1 Data Cache; <br> 2 * 2 MByte L2 Unified Cache
| -
| -
| Yes
| -
|-
|-
| Intel Itanium 2
| 3
| -
| 16 KByte L1 Instruction Cache ; 16 KByte L1 Data Cache; 265 KByte L2 Unified Cache; 3 - 9 MByte of L3 Unified Cache
| 4-Way Set Associative L1 Cache ; 8-Way Set Associative L2 Cache
| 6.4 - 2.1 GB/s transfer rate between L3 and External Memory
| L2 and L3 Cache is Shared
| -
|-
| CRAY X1
| 2
| -
| 32 KByte Core L1 Unified Cache; 512 MByte of L2 Unified Cache
| -
| 76 GB/s - 50 GB/s for Loads and 26 GB/s for stores between the L2 Cache and the Memory
| Yes
| Integrated Vector Cache is used for Coherence and helps tolerate memory latency
|}
|}
A single-core processor, also called a unicore processor, does not have the problems associated with a multi-core processor. A multi-core processor needs to worry about proper load-sharing between the processors, cache coherency, transfer of data between the cache and the memory, transfer of data between the caches themselves, if each core has a dedicated cache. The cache to cache / memory communication overhead plays a vital role in the design of multi-core processors with on-chip cache.
A unicore processor would be strongly tied to its lower-level memory, in that, it has unique access to the lower level memories unlike a mult-core processor, where the lower-level memories might be shared between the cores, giving rise to difficulty in managing data and instructions flowing over the bus between the cores and the memories.
== References ==
http://www.interfacebus.com/Controllers.html<br>
http://www-03.ibm.com/servers/eserver/opteron/pdf/IBM_dualcore_whitepaper.pdf<br>
http://arstechnica.com/news.ars/post/20061102-8135.html<br>
http://compreviews.about.com/od/cpus/a/dualcore.htm<br>
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2343<br>
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118,00.html<br>
http://www.interfacebus.com/Controllers.html<br>
http://www.broadcom.com/<br>
http://www.via.com.tw/en/products/processors/eden/<br>
http://www.centtech.com/<br>
http://www.ibm.com/us/en/<br>
http://www.streamprocessors.com/<br>
http://www.streamprocessors.com/streamprocessors/resources/<br>
http://www.netlib.org/utk/papers/advanced-computers/<br>

Latest revision as of 03:25, 29 September 2007

Objective

To create a table having details of the current multi-core architecture processors and their internal cache specifications.

Cache Sizes in multicore architectures

Cache Sizes have increased of late in the multicore architectures. Not only are we moving into a situation of designing more levels of cache inside the core, but also into designing caches which are shared among processors, not only inside, but outside the core as well.

Several design considerations have to be borne in mind before the actual core-cache design. Latency is the amount of time taken for the "missed" Address content from the lower memory level into the Cache. If we have multiple levels of cache, this time will reduce significantly. In fact, if we have more levels on-chip, then this latency will further reduce, giving a faster processor performance.

The number of times a processor goes outside the chip (off-chip) is also an important design factor. More the number of Lower-level memory accesses, especially off-chip memory accesses, worser is the processor performance. Hence the Associativity of the Cache, Number of levels of Cache, Cache Size, all play a vital role in deciding the ultimate processor performance.

Below is a table containing details of the current multi-core processor architectures along with their intricate details like number of levels, cache size, etc.

Multi-core Architecture Number of levels Line Size Cache Size Associativity Latency Is the Level shared Coherence Protocol used
AMD Opteron Processor 2 - 64 Byte L1 Cache - Data and Instruction Cache Separated, 1024 KByte L2 2 Way Associative ECC Protected L1 Data Cache & Parity Protected Instruction Cache;
16 Way Associative Parity Protected L2 Cache
Two 64 bit operations per 3 cycle latency No Exclusive cache architecture
AMD Athlon X2 Dual Core 2 - 64 Byte L1 Cache - Data and Instruction Cache Separated, 1024 KByte L2 2 Way Associative ECC Protected L1 Data Cache & Parity Protected Instruction Cache;
16 Way Associative Parity Protected L2 Cache
Two 64 bit operations per 3 cycle latency No Exclusive cache architecture
AMD Turin 64 Mobile 2 - 64 Kbyte L1; Upto 1MByte of L2 with 512 Kbyte Options 2-Way Associative ECC-Protected L1 Data Cache & Parity Protected L1 Instruction Cache;
16-Way Associative ECC-Protected L2 Cache
Two 64-bit operations per cycle, 3-cycle latency - With advanced branch prediction No Exclusive cache architecture—storage
AMD Sempron Processor 2 - 64-Kbyte ECC-Protected L1 Data Cache && Parity-Protected Instruction Cache;
256-Kbyte ECC-Protected

L2 Cache

2-Way Associative L1 Cache ; 16-Way Associative L2 Cache Two 64-bit operations per cycle, 3-cycle latency No Exclusive cache architecture—storage
AMD Athlon Duron Processor 2 - Integrated 128-Kbyte L1 Cache and an exclusive 64-Kbyte L2 Cache - - No Exclsive cache architecture-storage
AMD Palemo Processor 2 - 64 KByte L1 Data Cache & L1 Instruction Cache;
Unified 128 or 256 KByte L2 Cache
- - No Inclusive
AMD Thoroughbred (TBRED) 2 - 64 KByte L1 Data Cache & L1 Instruction Cache;
Unified 256 KByte full-speed L2 Cache
- - No Inclusive
AMD Barton Processor 2 - 64 KByte L1 Data Cache & L1 Instruction Cache;
Unified 512 KByte L2 Cache
- - No -
AMD Thunderbird 2 - - 16-Way - - -
CELL Processor (Playstation3 Processor) Manufactured by TOSHIBA, IBM and SONY Power PC Core, which is at the centre of the Cell, contains 2 Levels ;
Each of the "surrounding" SPEs have just one level of Cache
- 32 KByte Data Cache + 32 Kbyte Instruction Cache in the Power PC Core ;
The surrounding SPEs have 256 Kbyte Unified Cache
- - The L2 Cache of the Power PC Core is shared by the surrounding SPEs -
AMD Athlon 64 X2 Dual Core - 4600+ 2 - 128 KByte L1 Unified Cache ; 512 KByte L2 Unified Cache - - No -
Storm-1 Family by Stream Processors 1 - 16 KByte L1 Data / Instruction Cache - 533 MHz Data Rate between L1 and DDR Memory No NA
UltraSPARC IV 2 128 bytes to 512 bytes 64KByte Dual L1 Data Cache; 32KByte L2 extendable up to 16MB 2way set associative per core - No -
UltraSPARC IV+ 2 - 32 MByte L2 on-chip ; 32 MByte L3 External - - No -
UlatraSPARC T1 2 4 Banks for L2 8 KByte Data Cache ; 16 KByte Instruction Cache;
3 MByte L2 Cache
4-Way Set Associative for L1 ; 12-Way Set Associative for L2 Cache - No -
Intel Pentium D Series 3 (L1 + L2 + Execution Trace cache holding the decoded Micro-Ops) - 2 * 16 KByte of L1 Data Cache;
2 * 2 MByte L2 Unified Cache
- - Yes -
Intel Itanium 2 3 - 16 KByte L1 Instruction Cache ; 16 KByte L1 Data Cache; 265 KByte L2 Unified Cache; 3 - 9 MByte of L3 Unified Cache 4-Way Set Associative L1 Cache ; 8-Way Set Associative L2 Cache 6.4 - 2.1 GB/s transfer rate between L3 and External Memory L2 and L3 Cache is Shared -
CRAY X1 2 - 32 KByte Core L1 Unified Cache; 512 MByte of L2 Unified Cache - 76 GB/s - 50 GB/s for Loads and 26 GB/s for stores between the L2 Cache and the Memory Yes Integrated Vector Cache is used for Coherence and helps tolerate memory latency

A single-core processor, also called a unicore processor, does not have the problems associated with a multi-core processor. A multi-core processor needs to worry about proper load-sharing between the processors, cache coherency, transfer of data between the cache and the memory, transfer of data between the caches themselves, if each core has a dedicated cache. The cache to cache / memory communication overhead plays a vital role in the design of multi-core processors with on-chip cache.

A unicore processor would be strongly tied to its lower-level memory, in that, it has unique access to the lower level memories unlike a mult-core processor, where the lower-level memories might be shared between the cores, giving rise to difficulty in managing data and instructions flowing over the bus between the cores and the memories.

References

http://www.interfacebus.com/Controllers.html
http://www-03.ibm.com/servers/eserver/opteron/pdf/IBM_dualcore_whitepaper.pdf
http://arstechnica.com/news.ars/post/20061102-8135.html
http://compreviews.about.com/od/cpus/a/dualcore.htm
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2343
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118,00.html
http://www.interfacebus.com/Controllers.html
http://www.broadcom.com/
http://www.via.com.tw/en/products/processors/eden/
http://www.centtech.com/
http://www.ibm.com/us/en/
http://www.streamprocessors.com/
http://www.streamprocessors.com/streamprocessors/resources/
http://www.netlib.org/utk/papers/advanced-computers/