<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sykang</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sykang"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Sykang"/>
	<updated>2026-05-12T03:35:40Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4790</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4790"/>
		<updated>2007-09-29T02:01:42Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
Several cache technology which can expedite the speed of processing are used for modern processors over memory-CPU gap. Since the cache structure itself can affect the performance of the cache, to choose an appropriate structure  is an important and a hard to solve problem. For example, generally bigger cache shows better performance. However due to cache pollution, the performance shows diminishing returns as the cache size goes bigger. Thus we have to choose an appropriate cache size. From this point, it might be valuable to look through the cache chracteristics of modern processors.&lt;br /&gt;
&lt;br /&gt;
The cache structure can be determined by a few parameters such as cache size, replacement algorithm and associativity, and cache line size. While multi-core processors are introduced, the cache coherency also becomes issue and the coherency protocol such as MESI and MOESI affects the performance. In this psage, several cache parameters will be shown for modern multicore processors as well as for a couple of single-core processors.&lt;br /&gt;
&lt;br /&gt;
= Cache sizes in multicore architectures =&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Conclusion =&lt;br /&gt;
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity, although we can have many other sizes and associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol.&lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4789</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4789"/>
		<updated>2007-09-29T02:01:24Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Introduction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
Several cache technology which can expedite the speed of processing are used for modern processors over memory-CPU gap. Since the cache structure itself can affect the performance of the cache, to choose an appropriate structure  is an important and a hard to solve problem. For example, generally bigger cache shows better performance. However due to cache pollution, the performance shows diminishing returns as the cache size goes bigger. Thus we have to choose an appropriate cache size. From this point, it might be valuable to look through the cache chracteristics of modern processors.&lt;br /&gt;
&lt;br /&gt;
The cache structure can be determined by a few parameters such as cache size, replacement algorithm and associativity, and cache line size. While multi-core processors are introduced, the cache coherency also becomes issue and the coherency protocol such as MESI and MOESI affects the performance. In this psage, several cache parameters will be shown for modern multicore processors as well as for a couple of single-core processors.&lt;br /&gt;
&lt;br /&gt;
=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Conclusion =&lt;br /&gt;
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity, although we can have many other sizes and associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol.&lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4770</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4770"/>
		<updated>2007-09-29T01:10:26Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Conclusion */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
Several cache technology which can expedite the speed of processing are used for modern processors over memory-CPU gap. Since the cache structure itself can affect the performance of the cache, to choose the appropriate structure of cache is an important and hard to solve problem. For example, generally bigger cache shows better performance. However due to cache pollution, the performance shows diminishing returns as the cache size goes bigger. Thus we have to choose an appropriate cache size. Hence, it might be valuable to look through the cache chracteristics of modern processors.&lt;br /&gt;
&lt;br /&gt;
The cache structure can be determined by a few parameters such as cache size, replacement algorithm and associativity, and cache line size. While multi-core processors are introduced, the cache coherency also becomes issue and the coherency protocol such as MESI and MOESI affects the performance. In this material, several cache parameters will be shown for modern multicore processors as well as a couple of single-core processors.&lt;br /&gt;
&lt;br /&gt;
=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Conclusion =&lt;br /&gt;
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity, although we can have many other sizes and associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol.&lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4769</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4769"/>
		<updated>2007-09-29T01:07:07Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Introduction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
Several cache technology which can expedite the speed of processing are used for modern processors over memory-CPU gap. Since the cache structure itself can affect the performance of the cache, to choose the appropriate structure of cache is an important and hard to solve problem. For example, generally bigger cache shows better performance. However due to cache pollution, the performance shows diminishing returns as the cache size goes bigger. Thus we have to choose an appropriate cache size. Hence, it might be valuable to look through the cache chracteristics of modern processors.&lt;br /&gt;
&lt;br /&gt;
The cache structure can be determined by a few parameters such as cache size, replacement algorithm and associativity, and cache line size. While multi-core processors are introduced, the cache coherency also becomes issue and the coherency protocol such as MESI and MOESI affects the performance. In this material, several cache parameters will be shown for modern multicore processors as well as a couple of single-core processors.&lt;br /&gt;
&lt;br /&gt;
=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Conclusion =&lt;br /&gt;
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol. &lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4768</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4768"/>
		<updated>2007-09-29T01:05:35Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Introduction =&lt;br /&gt;
Several cache technology which can expedite the speed of processing are used for modern processors over memory-CPU gap. Since the cache structure itself can affect the performance of the cache, to choose the appropriate structure of cache is an important and hard to solve problem. For example, generally bigger cache shows better performance. However due to cache pollution, the performance shows diminishing returns as the cache size goes bigger. Thus we have to choose an appropriate cache size. Hence, it might be valuable to look through the cache chracteristics of modern processors.&lt;br /&gt;
&lt;br /&gt;
The cache structure can be determined by a few parameters such as cache size, cache type such direct mapped or set associative and so on, and cache line size. While multi-core processors are introduced, the cache coherency also becomes issue and the coherency protocol such as MESI and MOESI affects the performance. In this material, several cache parameters will be shown for modern multicore processors as well as a couple of single-core processors.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Conclusion =&lt;br /&gt;
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol. &lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4765</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4765"/>
		<updated>2007-09-29T00:41:35Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Conclusion */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol. &lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4500</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4500"/>
		<updated>2007-09-25T02:21:35Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Conclusion */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity. The cache lines are 64 or 128 bytes almostly. MESI and MOESI are the prevailed cache coherency protocol. &lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4499</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4499"/>
		<updated>2007-09-25T02:21:00Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Conclusion */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Most of processors which are introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity. The cache lines are 64 or 128 bytes almostly. MESI and MEOSI are the prevailed cache coherency protocol. &lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4492</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4492"/>
		<updated>2007-09-25T02:17:57Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Conclusion */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Most of processors which are introduced nowadays use 4 or 8 set-associative cache for both data and instruction cache.&lt;br /&gt;
The cache lines is 64 bytes. MESI and MEOSI are the prevailed cache coherency protocol. &lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4486</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4486"/>
		<updated>2007-09-25T02:09:59Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Conclusion */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
Most of processors use set-associative cache for both data and instruction cache, and use MESI or MEOSI cache coherency protocols. &lt;br /&gt;
&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4484</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4484"/>
		<updated>2007-09-25T02:07:13Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4483</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4483"/>
		<updated>2007-09-25T02:06:32Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16KB Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4482</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4482"/>
		<updated>2007-09-25T02:03:49Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16K Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data (2-way associative), 32KB instruction (2-Way associative)&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4481</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4481"/>
		<updated>2007-09-25T02:02:48Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16K Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32KB data (2-way associative), 32KB instruction (2-Way associative)&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4479</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4479"/>
		<updated>2007-09-25T02:02:17Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 16 KB data, 16K Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 bytes lines&amp;lt;br/2&amp;gt;L2 -&lt;br /&gt;
| L1 - 32KB data (2-way associative), 32KB instruction (2-Way associative)&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4477</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4477"/>
		<updated>2007-09-25T01:59:00Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - &lt;br /&gt;
| L1 - 16 KB data, 16K Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 4 way &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 bytes lines&amp;lt;br/2&amp;gt;L2 -&lt;br /&gt;
| L1 - 32KB data (2-way associative), 32KB instruction (2-Way associative)&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_05_sa&amp;diff=4472</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 05 sa</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_05_sa&amp;diff=4472"/>
		<updated>2007-09-25T01:49:29Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 -&lt;br /&gt;
| L1 - 32KB data (2-way associative), 32KB instruction (2-Way associative)&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - &lt;br /&gt;
| L1 - 16 KB data, 16K Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| &lt;br /&gt;
|-&lt;br /&gt;
| Intel Core Solo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - &lt;br /&gt;
| L1 - 32 KB data, 32KB Instruction&amp;lt;br/&amp;gt;L2 - 2MB&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - &lt;br /&gt;
| &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_05_sa&amp;diff=4470</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 05 sa</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_05_sa&amp;diff=4470"/>
		<updated>2007-09-25T01:47:35Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - &lt;br /&gt;
| L1 - 16 KB data, 16K Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| &lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/2&amp;gt;L2 -&lt;br /&gt;
| L1 - 32KB data (2-way associative), 32KB instruction (2-Way associative)&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_05_sa&amp;diff=4467</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 05 sa</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_05_sa&amp;diff=4467"/>
		<updated>2007-09-25T01:45:45Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* &amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=&amp;lt;center&amp;gt;Cache sizes in multicore architectures&amp;lt;/center&amp;gt;=&lt;br /&gt;
&lt;br /&gt;
''Topic'' - Create a table of caches used in current multicore architectures,&lt;br /&gt;
including such parameters as number of levels, line size, size and&lt;br /&gt;
associativity of each level, latency of each level, whether each level&lt;br /&gt;
is shared, and coherence protocol used. Compare this with two or three&lt;br /&gt;
recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot; align=&amp;quot;center&amp;quot;&lt;br /&gt;
|+'''Detail of Caches'''&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Multicore Processors&lt;br /&gt;
|-&lt;br /&gt;
! Processor Name&lt;br /&gt;
! Number of Levels&lt;br /&gt;
! Line Size&lt;br /&gt;
! Cache Size&lt;br /&gt;
! Associativity&lt;br /&gt;
! Coherence Protocol&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 X2&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 512KB to 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64 FX&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon Opteron&amp;lt;br/&amp;gt;(marketed for servers)&lt;br /&gt;
| 2&lt;br /&gt;
| 64 bytes (for both L1 &amp;amp; L2)&lt;br /&gt;
| L1 - 64KB (Data) + 64KB (Instruction) per core&amp;lt;br/&amp;gt;L2 - 1MB per core&lt;br /&gt;
| L1 - 2 way (Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 16 way associative&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium D&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 16 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core	&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (both Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 1MB or 2MB per core&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 32 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 2MB or 4MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Broadcom SiByte SB1250&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 byte lines&amp;lt;br/&amp;gt;L2 - 32 byte lines&lt;br /&gt;
| L1 - 32 KB (a piece for Data and Instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems UltraSPARC IV&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 128byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 64KB data, 32KB instruction&amp;lt;br/&amp;gt;L2 - up to 16MB&lt;br /&gt;
| L2 - 2 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| IBM Cell Processor&lt;br /&gt;
| 2&lt;br /&gt;
| Not Available&lt;br /&gt;
| L1 - 32 KB (a piece for both data and instruction caches)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way instruction, 4 way data&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
! colspan=&amp;quot;6&amp;quot; style=&amp;quot;background:#ffdead;&amp;quot; | Singlecore Processors&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 64 byte lines&lt;br /&gt;
| L1 - 64 KB (each for Data and Instruction cache)&amp;lt;br/&amp;gt;L2 - 512KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 16 way&lt;br /&gt;
| Modified Owner Exclusive Shared Invalid (MOESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium 4&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 64 byte lines&amp;lt;br/&amp;gt;L2 - 128 byte lines&lt;br /&gt;
| L1 - 8 KB (data only. Instead of instruction cache, a &amp;quot;150KB trace cache&amp;quot; is used))&amp;lt;br/&amp;gt;L2 -256KB, 512KB or 1MB&lt;br /&gt;
| L1 - 4 way&amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|-&lt;br /&gt;
| Intel PentiumIII 600&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - &lt;br /&gt;
| L1 - 16 KB data, 16K Instruction&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - &amp;lt;br/&amp;gt;L2 - 8 way&lt;br /&gt;
| &lt;br /&gt;
|-&lt;br /&gt;
| AMD K6 / K6 III&lt;br /&gt;
| 2&lt;br /&gt;
| L1 - 32 bytes lines&amp;lt;br/2&amp;gt;L2 -&lt;br /&gt;
| L1 - 32KB data (2-way associative), 32KB instruction (2-Way associative)&amp;lt;br/&amp;gt;L2 - 256KB&lt;br /&gt;
| L1 - 2 way&amp;lt;br/&amp;gt;L2 - 4 way&lt;br /&gt;
| Modified Exclusive Shared Invalid (MESI)&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;br /&gt;
&lt;br /&gt;
[12] http://en.wikipedia.org/wiki/Cell_microprocessor&lt;br /&gt;
&lt;br /&gt;
[13] http://techreport.com/articles.x/8236/2&lt;br /&gt;
&lt;br /&gt;
[14] http://www.hardwaresecrets.com/article/481/9&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4425</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4425"/>
		<updated>2007-09-24T23:16:16Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of multi-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
The following tables shows the current tide of the multicore processors's cache.&lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; &lt;br /&gt;
!Processor&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
!Opteron Server/Workstation&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!1 MB per core, 16-way associative&lt;br /&gt;
! &lt;br /&gt;
!Dual cores, introduced on 4/22/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 X2 family&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative &lt;br /&gt;
! &lt;br /&gt;
!Dual cores, intoduced on 5/13/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 FX&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative&lt;br /&gt;
! &lt;br /&gt;
!High performance desktop&lt;br /&gt;
|-&lt;br /&gt;
!Turion64 X2&lt;br /&gt;
!Total 256KB (128K per core)&lt;br /&gt;
!Total 1MB (512K per core)&lt;br /&gt;
! &lt;br /&gt;
!For Laptop&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Broadcom ===&lt;br /&gt;
* SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== IBM ===&lt;br /&gt;
* Cell&lt;br /&gt;
In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
* Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
* Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
&lt;br /&gt;
* Power5&lt;br /&gt;
Dual core&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
&lt;br /&gt;
* PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
&lt;br /&gt;
=== Intel ===&lt;br /&gt;
* Core2 Quad&lt;br /&gt;
&lt;br /&gt;
* Xeon Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
* Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
* Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
* Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
* Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
* Itanium 2&lt;br /&gt;
Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
&lt;br /&gt;
=== Stream Processors ===&lt;br /&gt;
* Strom-1 fmaily&lt;br /&gt;
40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
=== Sun Microsystems ===&lt;br /&gt;
* UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
* UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
* UlatraSPARC T1&lt;br /&gt;
8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
=== ETC ===&lt;br /&gt;
ARM&lt;br /&gt;
MPCore container for ARM9 &amp;amp; ARM11	&lt;br /&gt;
High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
PARISC&lt;br /&gt;
PA8800&lt;br /&gt;
L1: 1.5M Data 4-way associative + 1.5M Instruction 4-way associative, 2cycle&lt;br /&gt;
L2: 32M&lt;br /&gt;
&lt;br /&gt;
CT3600&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
Cavium Networks&lt;br /&gt;
Octeon&lt;br /&gt;
CN38XX&lt;br /&gt;
16 MIPS cores&lt;br /&gt;
L1: 8K Data + 32K Instruction&lt;br /&gt;
L2: 1M shared, ECC protected &lt;br /&gt;
Other: 2K write buffer per every MIPS core&lt;br /&gt;
&lt;br /&gt;
CN36XX&lt;br /&gt;
4 MIPS cores&lt;br /&gt;
L1: 8K Data + 32K Instruction&lt;br /&gt;
L2: 512K &lt;br /&gt;
Other: 2K write buffer per every MIPS core&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
&lt;br /&gt;
* Athlon 64&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4420</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4420"/>
		<updated>2007-09-24T23:03:44Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of multi-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
The following tables shows the current tide of the multicore processors's cache.&lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; &lt;br /&gt;
!Processor&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
!Opteron Server/Workstation&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!1 MB per core, 16-way associative&lt;br /&gt;
! &lt;br /&gt;
!Dual cores, introduced on 4/22/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 X2 family&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative &lt;br /&gt;
! &lt;br /&gt;
!Dual cores, intoduced on 5/13/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 FX&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative&lt;br /&gt;
! &lt;br /&gt;
!High performance desktop&lt;br /&gt;
|-&lt;br /&gt;
!Turion64 X2&lt;br /&gt;
!Total 256KB (128K per core)&lt;br /&gt;
!Total 1MB (512K per core)&lt;br /&gt;
! &lt;br /&gt;
!For Laptop&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== ARM ===&lt;br /&gt;
MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
=== Broadcom ===&lt;br /&gt;
* SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
=== CT3600 ===&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
=== Cavium Networks ===&lt;br /&gt;
* Octeon&lt;br /&gt;
16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
&lt;br /&gt;
=== IBM ===&lt;br /&gt;
* Cell&lt;br /&gt;
In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
* Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
* Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
&lt;br /&gt;
* Power5&lt;br /&gt;
Dual core&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
&lt;br /&gt;
* PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
&lt;br /&gt;
=== Intel ===&lt;br /&gt;
* Core2 Quad&lt;br /&gt;
&lt;br /&gt;
* Xeon Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
* Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
* Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
* Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
* Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
* Itanium 2&lt;br /&gt;
Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
&lt;br /&gt;
=== PARISC ===&lt;br /&gt;
* PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
=== Stream Processors ===&lt;br /&gt;
* Strom-1 fmaily&lt;br /&gt;
40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
=== Sun Microsystems ===&lt;br /&gt;
* UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
* UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
* UlatraSPARC T1&lt;br /&gt;
8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
&lt;br /&gt;
* Athlon 64&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4419</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4419"/>
		<updated>2007-09-24T23:03:18Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of multi-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
The following tables shows the current tide of the multicore processors's cache.&lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; &lt;br /&gt;
!Processor&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
!Opteron Server/Workstation&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!1 MB per core, 16-way associative&lt;br /&gt;
!&amp;amp;nbsp&lt;br /&gt;
!Dual cores, introduced on 4/22/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 X2 family&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative &lt;br /&gt;
!&amp;amp;nbsp&lt;br /&gt;
!Dual cores, intoduced on 5/13/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 FX&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative&lt;br /&gt;
!&amp;amp;nbsp&lt;br /&gt;
!High performance desktop&lt;br /&gt;
|-&lt;br /&gt;
!Turion64 X2&lt;br /&gt;
!Total 256KB (128K per core)&lt;br /&gt;
!Total 1MB (512K per core)&lt;br /&gt;
!&amp;amp;nbsp&lt;br /&gt;
!For Laptop&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== ARM ===&lt;br /&gt;
MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
=== Broadcom ===&lt;br /&gt;
* SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
=== CT3600 ===&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
=== Cavium Networks ===&lt;br /&gt;
* Octeon&lt;br /&gt;
16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
&lt;br /&gt;
=== IBM ===&lt;br /&gt;
* Cell&lt;br /&gt;
In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
* Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
* Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
&lt;br /&gt;
* Power5&lt;br /&gt;
Dual core&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
&lt;br /&gt;
* PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
&lt;br /&gt;
=== Intel ===&lt;br /&gt;
* Core2 Quad&lt;br /&gt;
&lt;br /&gt;
* Xeon Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
* Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
* Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
* Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
* Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
* Itanium 2&lt;br /&gt;
Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
&lt;br /&gt;
=== PARISC ===&lt;br /&gt;
* PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
=== Stream Processors ===&lt;br /&gt;
* Strom-1 fmaily&lt;br /&gt;
40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
=== Sun Microsystems ===&lt;br /&gt;
* UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
* UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
* UlatraSPARC T1&lt;br /&gt;
8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
&lt;br /&gt;
* Athlon 64&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4418</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4418"/>
		<updated>2007-09-24T23:02:27Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* AMD */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
The following tables shows the current tide of the multicore processors's cache.&lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; &lt;br /&gt;
!Processor&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
!Opteron Server/Workstation&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!1 MB per core, 16-way associative&lt;br /&gt;
!&lt;br /&gt;
!Dual cores, introduced on 4/22/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 X2 family&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative &lt;br /&gt;
!&lt;br /&gt;
!Dual cores, intoduced on 5/13/2005&lt;br /&gt;
|-&lt;br /&gt;
!Athlon64 FX&lt;br /&gt;
!64K Data + 64K Instruction per core&lt;br /&gt;
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2-way associative instruction cache, &lt;br /&gt;
!up to 1 MB per core, 16-way associative&lt;br /&gt;
!&lt;br /&gt;
!High performance desktop&lt;br /&gt;
|-&lt;br /&gt;
!Turion64 X2&lt;br /&gt;
!Total 256KB (128K per core)&lt;br /&gt;
!Total 1MB (512K per core)&lt;br /&gt;
!&lt;br /&gt;
!For Laptop&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Broadcom ===&lt;br /&gt;
* SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
=== CT3600 ===&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
=== Cavium Networks ===&lt;br /&gt;
* Octeon&lt;br /&gt;
16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
&lt;br /&gt;
=== IBM ===&lt;br /&gt;
* Cell&lt;br /&gt;
In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
* Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
* Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
&lt;br /&gt;
* Power5&lt;br /&gt;
Dual core&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
&lt;br /&gt;
* PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
&lt;br /&gt;
=== Intel ===&lt;br /&gt;
* Core2 Quad&lt;br /&gt;
&lt;br /&gt;
* Xeon Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
* Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
* Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
* Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
* Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
* Itanium 2&lt;br /&gt;
Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
&lt;br /&gt;
=== PARISC ===&lt;br /&gt;
* PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
=== Stream Processors ===&lt;br /&gt;
* Strom-1 fmaily&lt;br /&gt;
40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
=== Sun Microsystems ===&lt;br /&gt;
* UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
* UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
* UlatraSPARC T1&lt;br /&gt;
8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
&lt;br /&gt;
* Athlon 64&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4403</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4403"/>
		<updated>2007-09-24T22:16:49Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of single-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
The following tables shows the current tide of the multicore processors's cache.&lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
* Opteron Server/Workstation&lt;br /&gt;
2 cores, 4/22/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
* Athlon 64 X2 family&lt;br /&gt;
2 cores, 5/13/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
* Athlon 64 FX	High performance desktop&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative&lt;br /&gt;
&lt;br /&gt;
* Turion 64 X2	Laptop&lt;br /&gt;
L1: Total 256KB (128K per core)&lt;br /&gt;
L2: Total 1MB (512K per core)&lt;br /&gt;
ARM	MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
=== Broadcom ===&lt;br /&gt;
* SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
=== CT3600 ===&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
=== Cavium Networks ===&lt;br /&gt;
* Octeon&lt;br /&gt;
16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
&lt;br /&gt;
=== IBM ===&lt;br /&gt;
* Cell&lt;br /&gt;
In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
* Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
* Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
&lt;br /&gt;
* Power5&lt;br /&gt;
Dual core&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
&lt;br /&gt;
* PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
&lt;br /&gt;
=== Intel ===&lt;br /&gt;
* Core2 Quad&lt;br /&gt;
&lt;br /&gt;
* Xeon Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
* Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
* Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
* Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
* Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
* Itanium 2&lt;br /&gt;
Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
&lt;br /&gt;
=== PARISC ===&lt;br /&gt;
* PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
=== Stream Processors ===&lt;br /&gt;
* Strom-1 fmaily&lt;br /&gt;
40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
=== Sun Microsystems ===&lt;br /&gt;
* UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
* UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
* UlatraSPARC T1&lt;br /&gt;
8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
&lt;br /&gt;
* Athlon 64&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4402</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4402"/>
		<updated>2007-09-24T22:16:20Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of multi-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
The following tables shows the current tide of the multicore processors's cache.&lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== AMD ===&lt;br /&gt;
* Opteron Server/Workstation&lt;br /&gt;
2 cores, 4/22/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
* Athlon 64 X2 family&lt;br /&gt;
2 cores, 5/13/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
* Athlon 64 FX	High performance desktop&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative&lt;br /&gt;
&lt;br /&gt;
* Turion 64 X2	Laptop&lt;br /&gt;
L1: Total 256KB (128K per core)&lt;br /&gt;
L2: Total 1MB (512K per core)&lt;br /&gt;
ARM	MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
=== Broadcom ===&lt;br /&gt;
* SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
* SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
=== CT3600 ===&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
=== Cavium Networks ===&lt;br /&gt;
* Octeon&lt;br /&gt;
16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
&lt;br /&gt;
=== IBM ===&lt;br /&gt;
* Cell&lt;br /&gt;
In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
* Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
* Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
&lt;br /&gt;
* Power5&lt;br /&gt;
Dual core&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
&lt;br /&gt;
* PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
&lt;br /&gt;
=== Intel ===&lt;br /&gt;
* Core2 Quad&lt;br /&gt;
&lt;br /&gt;
* Xeon Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
* Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
* Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
* Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
* Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
* Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
* Itanium 2&lt;br /&gt;
Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
&lt;br /&gt;
=== PARISC ===&lt;br /&gt;
* PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
=== Stream Processors ===&lt;br /&gt;
* Strom-1 fmaily&lt;br /&gt;
40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
=== Sun Microsystems ===&lt;br /&gt;
* UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
* UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
* UlatraSPARC T1&lt;br /&gt;
8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4401</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4401"/>
		<updated>2007-09-24T22:10:28Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Reference */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
Wiki: Cache sizes in multicore architectures &lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AMD	Opteron	Server/workstation&lt;br /&gt;
2 cores, 4/22/2005&lt;br /&gt;
&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
	Athlon 64 X2 family	2 cores&lt;br /&gt;
5/13/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative &lt;br /&gt;
	Athlon 64 FX	High performance desktop&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative&lt;br /&gt;
	Athlon64 ###	L1: 64K Data + 64K Instr.&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2 : 512K&lt;br /&gt;
16 way associative&lt;br /&gt;
	Turion 64 X2	Laptop&lt;br /&gt;
L1: Total 256KB (128K per core)&lt;br /&gt;
L2: Total 1MB (512K per core)&lt;br /&gt;
ARM	MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
Broadcom	SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
CT3600&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
Cavium Networks	Octeon	16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
IBM	Cell	In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
	Power5	Dual core&lt;br /&gt;
&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
	PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
Intel	Core2 Quad&lt;br /&gt;
Xeon	Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
	Itanium 2	Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
&lt;br /&gt;
Itanium2&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
PARISC	PA8800	PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
Stream Processor	Strom-1 fmaily	40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
Sun Microsystems	UltraSPARC IV&lt;br /&gt;
UltraSPARC IV+	UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
	UlatraSPARC T1	8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4400</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4400"/>
		<updated>2007-09-24T22:10:20Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Reference */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
Wiki: Cache sizes in multicore architectures &lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AMD	Opteron	Server/workstation&lt;br /&gt;
2 cores, 4/22/2005&lt;br /&gt;
&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
	Athlon 64 X2 family	2 cores&lt;br /&gt;
5/13/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative &lt;br /&gt;
	Athlon 64 FX	High performance desktop&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative&lt;br /&gt;
	Athlon64 ###	L1: 64K Data + 64K Instr.&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2 : 512K&lt;br /&gt;
16 way associative&lt;br /&gt;
	Turion 64 X2	Laptop&lt;br /&gt;
L1: Total 256KB (128K per core)&lt;br /&gt;
L2: Total 1MB (512K per core)&lt;br /&gt;
ARM	MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
Broadcom	SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
CT3600&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
Cavium Networks	Octeon	16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
IBM	Cell	In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
	Power5	Dual core&lt;br /&gt;
&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
	PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
Intel	Core2 Quad&lt;br /&gt;
Xeon	Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
	Itanium 2	Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
&lt;br /&gt;
Itanium2&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
PARISC	PA8800	PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
Stream Processor	Strom-1 fmaily	40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
Sun Microsystems	UltraSPARC IV&lt;br /&gt;
UltraSPARC IV+	UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
	UlatraSPARC T1	8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== Reference ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4399</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4399"/>
		<updated>2007-09-24T22:10:01Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Reference */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
Wiki: Cache sizes in multicore architectures &lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AMD	Opteron	Server/workstation&lt;br /&gt;
2 cores, 4/22/2005&lt;br /&gt;
&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
	Athlon 64 X2 family	2 cores&lt;br /&gt;
5/13/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative &lt;br /&gt;
	Athlon 64 FX	High performance desktop&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative&lt;br /&gt;
	Athlon64 ###	L1: 64K Data + 64K Instr.&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2 : 512K&lt;br /&gt;
16 way associative&lt;br /&gt;
	Turion 64 X2	Laptop&lt;br /&gt;
L1: Total 256KB (128K per core)&lt;br /&gt;
L2: Total 1MB (512K per core)&lt;br /&gt;
ARM	MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
Broadcom	SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
CT3600&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
Cavium Networks	Octeon	16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
IBM	Cell	In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
	Power5	Dual core&lt;br /&gt;
&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
	PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
Intel	Core2 Quad&lt;br /&gt;
Xeon	Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
	Itanium 2	Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
&lt;br /&gt;
Itanium2&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
PARISC	PA8800	PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
Stream Processor	Strom-1 fmaily	40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
Sun Microsystems	UltraSPARC IV&lt;br /&gt;
UltraSPARC IV+	UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
	UlatraSPARC T1	8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== Reference ==&lt;br /&gt;
[1] http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
[5] http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
[6] http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4398</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4398"/>
		<updated>2007-09-24T22:08:57Z</updated>

		<summary type="html">&lt;p&gt;Sykang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
Wiki: Cache sizes in multicore architectures &lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AMD	Opteron	Server/workstation&lt;br /&gt;
2 cores, 4/22/2005&lt;br /&gt;
&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
	Athlon 64 X2 family	2 cores&lt;br /&gt;
5/13/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative &lt;br /&gt;
	Athlon 64 FX	High performance desktop&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative&lt;br /&gt;
	Athlon64 ###	L1: 64K Data + 64K Instr.&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2 : 512K&lt;br /&gt;
16 way associative&lt;br /&gt;
	Turion 64 X2	Laptop&lt;br /&gt;
L1: Total 256KB (128K per core)&lt;br /&gt;
L2: Total 1MB (512K per core)&lt;br /&gt;
ARM	MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
Broadcom	SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
CT3600&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
Cavium Networks	Octeon	16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
IBM	Cell	In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
	Power5	Dual core&lt;br /&gt;
&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
	PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
Intel	Core2 Quad&lt;br /&gt;
Xeon	Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
	Itanium 2	Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
&lt;br /&gt;
Itanium2&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
PARISC	PA8800	PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
Stream Processor	Strom-1 fmaily	40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
Sun Microsystems	UltraSPARC IV&lt;br /&gt;
UltraSPARC IV+	UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
	UlatraSPARC T1	8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;br /&gt;
&lt;br /&gt;
== Reference ==&lt;br /&gt;
http://www.amd.com/us-en/Processors/ProductInformation&lt;br /&gt;
http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250&lt;br /&gt;
http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor&lt;br /&gt;
http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets&lt;br /&gt;
http://www.sun.com/processors/UltraSPARC-IV/&lt;br /&gt;
http://www.sun.com/processors/UltraSPARC-IV+/&lt;br /&gt;
http://www.sun.com/processors/UltraSPARC-T1/specs.xml&lt;br /&gt;
http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html&lt;br /&gt;
http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html&lt;br /&gt;
http://www.netlib.org/utk/papers/advanced-computers/power4.html&lt;br /&gt;
http://www.netlib.org/utk/papers/advanced-computers/power5.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4397</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4397"/>
		<updated>2007-09-24T22:03:32Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of multi-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
Wiki: Cache sizes in multicore architectures &lt;br /&gt;
&lt;br /&gt;
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AMD	Opteron	Server/workstation&lt;br /&gt;
2 cores, 4/22/2005&lt;br /&gt;
&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: 1 MB per core, 16way associative &lt;br /&gt;
&lt;br /&gt;
	Athlon 64 X2 family	2 cores&lt;br /&gt;
5/13/2005&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative &lt;br /&gt;
	Athlon 64 FX	High performance desktop&lt;br /&gt;
L1: 64KB (Data) + 64KB (Instruction) per core&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2: up to 1 MB per core, 16way associative&lt;br /&gt;
	Athlon64 ###	L1: 64K Data + 64K Instr.&lt;br /&gt;
2way associative data cache, two 64bits operations per cycle, 3 cycle latency&lt;br /&gt;
2way associative instruction cache, &lt;br /&gt;
L2 : 512K&lt;br /&gt;
16 way associative&lt;br /&gt;
	Turion 64 X2	Laptop&lt;br /&gt;
L1: Total 256KB (128K per core)&lt;br /&gt;
L2: Total 1MB (512K per core)&lt;br /&gt;
ARM	MPCore container for ARM9 &amp;amp; ARM11	High-performance embedded and entertainment&lt;br /&gt;
&lt;br /&gt;
Broadcom	SiByte	SB1250&lt;br /&gt;
Two scalable MIPS core, MESI&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1255&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
Cache block 32 bytes&lt;br /&gt;
Cache line 32 bytes&lt;br /&gt;
L1 to L1 latency : 28~36 cycles&lt;br /&gt;
&lt;br /&gt;
L2: 512K shared, ECC, 4way associative&lt;br /&gt;
32 bytes cache line&lt;br /&gt;
&lt;br /&gt;
SB1455&lt;br /&gt;
L1: 32K data + 32K instruction&lt;br /&gt;
&lt;br /&gt;
L2: 1MB shared, 8way associative, ECC protected&lt;br /&gt;
Cradle Technology	CT3400&lt;br /&gt;
CT3600	Multi-core DSP&lt;br /&gt;
CT3400&lt;br /&gt;
8 32bits DSPs, 6 RISC-like CPUs&lt;br /&gt;
32K instruction cache, 64K local data memory&lt;br /&gt;
&lt;br /&gt;
CT3600&lt;br /&gt;
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad&lt;br /&gt;
&lt;br /&gt;
Cavium Networks	Octeon	16 MIPS cores&lt;br /&gt;
CN38XX/CN36XX 4 to 16 MIPS64 cores&lt;br /&gt;
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)&lt;br /&gt;
32K instruction cache/8K data cache/2K write buffer per every MIPS core&lt;br /&gt;
IBM	Cell	In the PlayStation 3&lt;br /&gt;
PowerPC based&lt;br /&gt;
8 cores optimized for vector operation&lt;br /&gt;
	Power4	1st dual core&lt;br /&gt;
2000&lt;br /&gt;
&lt;br /&gt;
Power4&lt;br /&gt;
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)&lt;br /&gt;
     32K per CPU data cache (2way, 128byte line, LRU)&lt;br /&gt;
L2: 1440K, 8way, 128 byte line&lt;br /&gt;
L3: 128M, 8way 512 byte line&lt;br /&gt;
&lt;br /&gt;
Power4+&lt;br /&gt;
L1: data 32K(2way set), instruction 64K(directly mapped)&lt;br /&gt;
L2: 3 x 0.5M shared by dual core, 40B/cycle per port&lt;br /&gt;
L3: 32M&lt;br /&gt;
	Power5	Dual core&lt;br /&gt;
&lt;br /&gt;
L1: I-64K(2way LRU), D-32K(4way, LRU)&lt;br /&gt;
L2: three 0.625M, 10way, LRU, 128 byte line&lt;br /&gt;
L3: 36M, 12way&lt;br /&gt;
	PowerPC 970MP	Dual&lt;br /&gt;
In the Apple PowerMac&lt;br /&gt;
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)&lt;br /&gt;
&lt;br /&gt;
L2: 1M per core, ECC&lt;br /&gt;
Cache-coherency snooping protocol&lt;br /&gt;
Intel	Core2 Quad&lt;br /&gt;
Xeon	Quad core&lt;br /&gt;
12/13/2006&lt;br /&gt;
&lt;br /&gt;
Xeon 5000&lt;br /&gt;
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)&lt;br /&gt;
L2: 2MB per core&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Xeon MP7000&lt;br /&gt;
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)&lt;br /&gt;
L2: 1MB per core or 2x2MB per core&lt;br /&gt;
&lt;br /&gt;
Xeon MP7300&lt;br /&gt;
L1: 32K data per core, 32K instruction per core&lt;br /&gt;
L2: up tp 8MB, snoop filter, 4M shared L2 per die&lt;br /&gt;
	Core Duo&lt;br /&gt;
Core2 Duo&lt;br /&gt;
Xeon (x1xx)	Dual core&lt;br /&gt;
&lt;br /&gt;
Xeon 5100&lt;br /&gt;
L1 : 32KB (Data) + 32KB (Instruction) per core&lt;br /&gt;
L2 : 4MB (Shared) &lt;br /&gt;
&lt;br /&gt;
Xeon 7100&lt;br /&gt;
L1: 16K data&lt;br /&gt;
L2: 2M 8way ECC&lt;br /&gt;
L3: up to 16M ECC. 16way&lt;br /&gt;
&lt;br /&gt;
	Itanium 2	Multi-core&lt;br /&gt;
Montecito&lt;br /&gt;
&lt;br /&gt;
Itanium2&lt;br /&gt;
L1:32KB&lt;br /&gt;
L2:256KB&lt;br /&gt;
L3:up to 9MB&lt;br /&gt;
PARISC	PA8800	PA8800&lt;br /&gt;
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle&lt;br /&gt;
L2: 32M &lt;br /&gt;
&lt;br /&gt;
Stream Processor	Strom-1 fmaily	40 to 80 ALUs&lt;br /&gt;
&lt;br /&gt;
SP16HP-G220, SP16-G160, SP8-G80&lt;br /&gt;
L1: 32K data, 16K instruction&lt;br /&gt;
&lt;br /&gt;
96K VLIW instruction memory&lt;br /&gt;
&lt;br /&gt;
Sun Microsystems	UltraSPARC IV&lt;br /&gt;
UltraSPARC IV+	UltraSPARC IV&lt;br /&gt;
dual&lt;br /&gt;
L1: 64K data, 32K instruction&lt;br /&gt;
L2: up to 16MB, external 8M 2way set associative per core&lt;br /&gt;
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC &lt;br /&gt;
write cache: hash-indexed 2K,&lt;br /&gt;
2K prefetch cache&lt;br /&gt;
&lt;br /&gt;
UltraSPARC IV+&lt;br /&gt;
L2: 2M on-chip &lt;br /&gt;
L3: 32M external&lt;br /&gt;
&lt;br /&gt;
	UlatraSPARC T1	8 cores&lt;br /&gt;
L1: 16K instruction cache per core, 4way-set associative parity protected&lt;br /&gt;
8K data cache , parity protected, 4way set associative&lt;br /&gt;
&lt;br /&gt;
L2: 3M 12way, 4banks, ECC&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4396</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4396"/>
		<updated>2007-09-24T21:53:54Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of multi-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Cache size and characteristics of single-core processors ==&lt;br /&gt;
&lt;br /&gt;
== Conclusion ==&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4395</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4395"/>
		<updated>2007-09-24T21:53:07Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Cache size and characteristics of multi-core processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Cache size and characteristics of multi-core processors ==&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2-5-as&amp;diff=4394</id>
		<title>CSC/ECE 506 Fall 2007/wiki2-5-as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2-5-as&amp;diff=4394"/>
		<updated>2007-09-24T21:50:36Z</updated>

		<summary type="html">&lt;p&gt;Sykang: CSC/ECE 506 Fall 2007/wiki2-5-as moved to CSC/ECE 506 Fall 2007/wiki2 5 as: To match with the naming rule&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;#REDIRECT [[CSC/ECE 506 Fall 2007/wiki2 5 as]]&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4393</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4393"/>
		<updated>2007-09-24T21:50:36Z</updated>

		<summary type="html">&lt;p&gt;Sykang: CSC/ECE 506 Fall 2007/wiki2-5-as moved to CSC/ECE 506 Fall 2007/wiki2 5 as: To match with the naming rule&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Cache size and characteristics of multi-core processors =&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4392</id>
		<title>CSC/ECE 506 Fall 2007/wiki2 5 as</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki2_5_as&amp;diff=4392"/>
		<updated>2007-09-24T21:48:25Z</updated>

		<summary type="html">&lt;p&gt;Sykang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Cache size and characteristics of multi-core processors =&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3475</id>
		<title>Talk:CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3475"/>
		<updated>2007-09-11T02:36:31Z</updated>

		<summary type="html">&lt;p&gt;Sykang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= The reviewer pointed out as below.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;not very well; a separate section for superscalar could have been added rather than clubbing it with VLIW&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It seems to be persuadable, however, I don't think so because superscalar processor has been the most major issues over most micro architecure past 20 years more. The explored architecure such as VLIW and others by this assignment are based on basically supersclar processor, although superscalar is the most important architecture. The visited architectures are a kind of additional benecial features on superscalar.&lt;br /&gt;
&lt;br /&gt;
= The reviewer pointed out the material seems not to be original, because the rephrased paragraph took small portion. The updated version are restructured, rephrased and added version of the previous one.&lt;br /&gt;
&lt;br /&gt;
= The reviewer pointed out as below.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;the author mentions that the current technology used in .13 microns. I think the current market trend is .045 microns with active research going on on .022 micron technology. That fact could have been mentioned.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
There were misunderstood of length unit, micron. The new version reflects new information.&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3467</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3467"/>
		<updated>2007-09-11T02:28:04Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Architectural Trends */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;) dimension using 0.35um process. It fetches 4 instructions simultaneously and has total of 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors are attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
it takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
&lt;br /&gt;
[[Image:VLIW.jpg]]&lt;br /&gt;
&lt;br /&gt;
it takes 9 cycles assuming 5 execution pipelines. &lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also a good example. It has 2 integer functional units and 3 types of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Yet another example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-grained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instructions from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with a high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has the cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading that uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Figure 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good examples of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, more than just accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[12]&lt;br /&gt;
&lt;br /&gt;
[[Image:speculative.jpg]]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as does checkpoint. While checkpointing, the processor duplicates the copy of information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaims the stored information for use by new predicted branches. But if the prediction is incorrect, it resotres the execution information from the corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, and AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;br /&gt;
&lt;br /&gt;
[12] Eric Rotenberg, ECE721 Advanced Microarchitecture lecture notes, NCSU, 2007&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3445</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3445"/>
		<updated>2007-09-11T02:10:02Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Speculative Execution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[12]&lt;br /&gt;
&lt;br /&gt;
[[Image:speculative.jpg]]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;br /&gt;
&lt;br /&gt;
[12] Eric Rotenberg, ECE721 Advanced Microarchitecture lecture notes, NCSU, 2007&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Speculative.jpg&amp;diff=3444</id>
		<title>File:Speculative.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Speculative.jpg&amp;diff=3444"/>
		<updated>2007-09-11T02:09:47Z</updated>

		<summary type="html">&lt;p&gt;Sykang: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3443</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3443"/>
		<updated>2007-09-11T02:09:39Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Speculative Execution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[12]&lt;br /&gt;
&lt;br /&gt;
[[Image:speculative.jpg]]&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;br /&gt;
&lt;br /&gt;
[12] Eric Rotenberg, ECE721 Advanced Microarchitecture lecture notes, NCSU, 2007&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3440</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3440"/>
		<updated>2007-09-11T02:07:01Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;br /&gt;
&lt;br /&gt;
[12] Eric Rotenberg, ECE721 Advanced Microarchitecture lecture notes, NCSU, 2007&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3437</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3437"/>
		<updated>2007-09-11T02:03:19Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* Updated Figure 1.8 &amp;amp; 1.9 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3434</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3434"/>
		<updated>2007-09-11T02:02:19Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3432</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3432"/>
		<updated>2007-09-11T02:01:57Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3431</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3431"/>
		<updated>2007-09-11T02:01:37Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3430</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3430"/>
		<updated>2007-09-11T02:01:02Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3429</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3429"/>
		<updated>2007-09-11T02:00:39Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg|left]]~&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3428</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3428"/>
		<updated>2007-09-11T02:00:02Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
&lt;br /&gt;
[[Image:simpleMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
&lt;br /&gt;
[[Image:loopunrollMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3427</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3427"/>
		<updated>2007-09-11T01:59:16Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
[[Image:simpleMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
[[Image:loopunrollMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3426</id>
		<title>CSC/ECE 506 Fall 2007/wiki1 4 a1</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki1_4_a1&amp;diff=3426"/>
		<updated>2007-09-11T01:58:53Z</updated>

		<summary type="html">&lt;p&gt;Sykang: /* VLIW */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Architectural Trends ==&lt;br /&gt;
[[Image:MIPSR10000.jpg|thumb|right|300px|Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])]]&lt;br /&gt;
[[Image:IntelMoorsLaw.jpg|thumb|right|300px|Fig.2 The number of trnasistors on a chip of Intel]]&lt;br /&gt;
Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have dramatically decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism; [http://en.wikipedia.org/wiki/Instruction_level_parallelism ILP] and [http://en.wikipedia.org/wiki/Thread_level_parallelism TLP]. With respect to microprocessor architecture, as superscalar processor prevails, several additional exploitable architectures were also proposed during past 10 years as other past decades did. Based on superscalar architecture, VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on were proposed and explored. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.&lt;br /&gt;
&lt;br /&gt;
For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== VLIW ===&lt;br /&gt;
VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW. The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue multiple numbers of instructions per clock, which are scheduled either statically or dynamically, VLIWs issue statically sceduled instructions by the compiler. Both superscalar and VLIW have multiple and independent functional units.&lt;br /&gt;
&lt;br /&gt;
VLIW processor's compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. VLIW issues a fixed number of instructions, the format of which can be either one large instruction or a fixed instruction packet with the parallelism.&lt;br /&gt;
&lt;br /&gt;
To look into the inside of VLIW operation, assume the below example code for MIPS[1].&lt;br /&gt;
&lt;br /&gt;
for (i=1000; i&amp;gt;0; i=i-1)   x[i] = x[i] + s;&lt;br /&gt;
&lt;br /&gt;
The standard MIPS code looks like this:&lt;br /&gt;
[[Image:simpleMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
If loop-unrolling and scheduling the code are applied, then&lt;br /&gt;
[[Image:loopunrollMIPS.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
It takes 14 cycles for loop body.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If VLIW instructions are used, then&lt;br /&gt;
[[Image:VLIW.jpg|left]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
MIPS R10000 is also example. it has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation. Other example of VLIW is i860 of Trimedia.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Multi-threading ===&lt;br /&gt;
[[Image:SMTEx.jpg|thumb|right|300px|Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])]]&lt;br /&gt;
Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. In addition, the processor can switch the different thread enough quickly. &lt;br /&gt;
&lt;br /&gt;
For attaining multi-threading, there are two basic approaches; fine-garained multi-threading and coarse-grained multi-threading. The former switches each instruction between multiple interleaved threads. For this interleaving, the processor can switch threads on every clock cycle. The advantage of this architecture can prohibit stalling, because other instruction from other threads can be performed when one thread stalls. The disadvantage makes slow down the individual thread's execution, because even though the instruction is ready to be executed, it can be interleaved by another thread's instruction.&lt;br /&gt;
&lt;br /&gt;
The latter switches threads when it meets the stall only with high cost. This policy reduces unnecessary switching of thread, so that the individual thread does not need to slow down its execution contrary to the fine-grained case. However, it has a cost when switching occurs to fill the pipeline. This kind of processor issues instructions from a single thread, although it switches the running thread. If the stall occurs, the pipeline is empty. Then, in order to execute a new thread instead of stalled thread, the pipeline has to be filled, which results in the cost.&lt;br /&gt;
&lt;br /&gt;
The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle. Fig. 3 shows the comparison between three kinds of multi-threading in addition to a superscalar processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Multi-core ===&lt;br /&gt;
[[Image:Smithfield_die_med.jpg|thumb|right|80px|Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]]]&lt;br /&gt;
Multi-core CPUs have multiple numbers of CPU cores on a single die. They are connected to each other through a shared L2 or L3 cache, or a glue logic like switch and bus on a die. Every CPU core on a die shares interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus), a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. The advantages of multi-core chips are power-efficiency and simplicity around the processors. Since multiple processors are packed into a single die, the glue logics which are required to connect to each processor are also packed into a die. It saves power and simplifies auxilary circuits than coupled processors, which need PCB circuits.&lt;br /&gt;
Intel Pentium Extreme, Coreduo and Coreduo2 are good example of multi-core processors.&lt;br /&gt;
Intel Xeon X7300 series has quad-core in a single die with 65nm processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== Speculative Execution ===&lt;br /&gt;
While trying to get more ILP, managing control dependencies becomes more important but more burden. To reduce the cost of stall because of branch, branch prediction techinque is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, More things than accurate prediction are required. To speculate is to act on these predictions; fetch and execute instructions from the predicted path.[]&lt;br /&gt;
&lt;br /&gt;
Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation. If the processor meets a branch, it predicts the branch target and follows that path as well as checkpoint. While checkpointing, the processor duplicates the copy information such as register files and control information and another possible branch target and so on. If the prediction is correct, the processor reclaim the stored information for use by new predicated branches. But if the prediction is incorrect, it resotres the execution information from corresponding checkpoint. There are examples like PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== Updated Figure 1.8 &amp;amp; 1.9 ==&lt;br /&gt;
[[Image:fig18.jpg|frame|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.8 of our book has been updated to incorporate trends from 2000 to the present. SGI Origin 3000 series were reintroduced as Origin 3400 and Origin 3900 in year 2000 and 2003, respectively. Sun introduced even more powerful enterprise servers than E10000, which are E15000 in year 2002, E20000 and E25000 in year 2006. HP's high-end supercomputer 9000 Superdome with 16, 32, and 64 processors are released this year(2007).&lt;br /&gt;
[[Image:fig19.jpg|frame|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors(Y-axis is log-scaled)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure 1.9 shows the bandwidth of shared memory bus of those servers introduced in figure 1.8, which are SGI Origin 3000 series, SUN Enterprise 15K, 20K, and 25K as well as IBM p5 590 and HP9000 Superdome. In the case of Sun E25K, the bandwidth available is 43.2 GBps and the aggregated bandwith exceeds 100GBps. Origin 3900 has 12.8 GBps bandwidth and the aggregate bandwidth of 172.8 GBps.&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] John L. Hennessy, David A. Patterson, &amp;quot;Computer Architecture: A Quantitative Approach&amp;quot; 3rd Ed., Morgan Kaufmann, CA, USA&lt;br /&gt;
&lt;br /&gt;
[2] CE Kozyrakis, DA Patterson, &amp;quot;A new direction for computer architecture research&amp;quot;, &lt;br /&gt;
Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32&lt;br /&gt;
&lt;br /&gt;
[3] K.C. Yeager, &amp;quot;The MIPS R10000 Superscalar Microprocessor&amp;quot;, IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41&lt;br /&gt;
&lt;br /&gt;
[4] Geoff Koch, &amp;quot;Discovering Multi-Core: Extending the Benefits of Moore’s Law&amp;quot;, Technology@Intel Magazine, Jul 2005, pp1-6&lt;br /&gt;
&lt;br /&gt;
[5] Richard Low, &amp;quot;Microprocessor trends:multicore, memory, and power developments&amp;quot;, Embedded Computing Design, Sep 2005&lt;br /&gt;
&lt;br /&gt;
[6] Artur Klauser, &amp;quot;Trends in High-Performance Microprocessor Design&amp;quot;, Telematik 1, 2001&lt;br /&gt;
&lt;br /&gt;
[7] http://www.intel.com &amp;amp; http://www.intel.com/pressroom/kits/pentiumee&lt;br /&gt;
&lt;br /&gt;
[8] http://www.alimartech.com/9000_servers.htm&lt;br /&gt;
&lt;br /&gt;
[9] http://www.sun.com/servers/index.jsp?gr0=cpu&amp;amp;fl0=cpu4&amp;amp;gr1=&lt;br /&gt;
&lt;br /&gt;
[10] http://www.sgi.com/pdfs/3867.pdf&lt;br /&gt;
&lt;br /&gt;
[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html&lt;/div&gt;</summary>
		<author><name>Sykang</name></author>
	</entry>
</feed>