CSC 456 Fall 2013/1c wa: Difference between revisions

Revision as of 22:02, 3 October 2013

Trends in cache size and organization

Introduction

Cache size has grown over the years alongside the evolution of the micropocessor. Intuitively one would expect cache sizes to keep growing larger and larger following some law similar to Moore’s Law. In actuality however L1 cache sizes have all but maxed out for an individual processor. Observing the trend of cache growth it can be seen that some processor lines stopped growing from one iteration to the next and in some cases even decreased in size. To go along with this, cache associativity has varied over the years. While it is true that no cache organization is optimal for every situation certain organizations certainly perform better for most tasks on certain systems. This wiki will try to analyze data on cache size and associativities to gain some insight into the trends and reasoning behind vendor choices of cache size and organization over the years. Specifically it looks from the late 80’s / early 90’s to the early 2000’s.

Cache Associativity

This table shows cache associativities found in some mainstream processors from the late 80’s to the early 2000’s with one processor from 1968 just for reference. As can be seen from the data, the late 80’s early 90’s tended towards a set associative cache with around four lines. In the mid-90’s it tended towards lower associativity and direct mapping. Then in the late 90’s and early 2000’s it tended back towards higher associativities with larger set sizes again.

L1, L2, L3 Associativity

System	Year	L1 Associativity	L2 Associativity	L3 Associativity	Notes:
IBM 360/85	1968	Sector	N/A	N/A	First processor with a cache, clock speed 12.5MHz
Intel 80486	1989	4-way associative	N/A	N/A
SuperSPARC	1992	4 & 5 way set	N/A	N/A	Used to render Toy Story, Core @ 40MHz
Alpha 21064(DEC)	1992	Direct	Direct	N/A
UltraSPARC	1995	2-Way & Direct	Direct	N/A	64-bit w/ Core@200MHz
Alpha 21164(DEC)	1995	Direct	3 way set	N/A
Pentium Pro	1995	2 & 4 way	?	N/A	First on-die L2
K6-III	1999	2 way	4 way	n/a
Pentium 4	10/2000	4 Way	8 Way	N/A
UltraSPARC III	2001	4 Way	N/A	N/A
Itanium 2	2002	4 -way	8-way	12 way

Cache Size

Looking at this table, we can see a general progression of cache sizes growing for all manufacturers. However if we look at certain manufacturers from year to year we can sometimes see a decrease in cache size. For instance in 1992 the SuperSPARC had a 16+20 KB L1 cache, then in 1995 the UltraSPARC only had a 16+16 KB L1 cache. The L2 cache capacity increased however. More recently we can see that from 2008 to 2011 Intel decreased the cache size of its Core line of processors from the 64 KB in the Nehalem architecture to 32 KB in the Sandy Bridge line. In this instance the L2 cache stayed the same at 256 KB but the L3 capacity increased.

L1, L2, L3 Size by Year

Processor	System Type	Year	L1 size	L2 size	L3 size
IBM 360/85	Mainframe	1968	16 to 32 KB	—	—
PDP-11/70	Minicomputer	1975	1 KB	—	—
VAX 11/780	Minicomputer	1978	16 KB	—	—
IBM 3033	Mainframe	1978	64 KB	—	—
IBM 3090	Mainframe	1985	128 to 256 KB	—	—
Intel 80486	PC	1989	8 KB	—	—
SuperSPARC	PC	1992	16 KB/20 KB	0 to 2 MB	—
Pentium	PC	1993	8 KB/8 KB	256 to 512 KB	—
PowerPC 601	PC	1993	32 KB	—	—
UltraSPARC	PC	1995	16 KB/16 KB	512 KB to 4 MB	—
Pentium Pro	PC	1995	8 KB/8 KB	256 KB - 1 MB	—
PowerPC	620 PC	1996	32 KB/32 KB	—	—
PowerPC G4	PC/server	1999	32 KB/32 KB	256 KB to 1 MB	2 MB
IBM S/390 G4	Mainframe	1997	32 KB	256 KB	2 MB
IBM S/390 G6	Mainframe	1999	256 KB	8 MB	—
Pentium 4	PC/server	2000	8 KB/8 KB	256 KB	—
IBM SP	High-end server	2000	64 KB/32 KB	8 MB	—
CRAY MTAb	Supercomputer	2000	8 KB	2 MB	—
UltraSPARCIII	PC	2001	32 KB/64 KB	2 to 8 MB	—
Itanium	PC/server	2001	16 KB/16 KB	96 KB	4 MB
SGI Origin 2001	High-end server	2001	32 KB/32 KB	4 MB	—
Itanium 2	PC/server	2002	32 KB	256 KB	6 MB
IBM POWER5	High-end server	2003	64 KB	1.9 MB	36 MB
CRAY XD-1	Supercomputer	2004	64 KB/64 KB	1MB	—
Nehalem (i5,7, Xenon)	PC, Server	2008	64 KB per	256 KB per	4 MB to 12 MB total
Sandy Bridge (i3-7, Pent.)	PC	2011	32 KB per	256 KB per	1 MB to 20 MB total

Main Memory Specs

Finally main memory latency needs to be analyzed to see how it can affect the cache. The cache is a necessary piece of hardware in the first place due to the severe disparity between processor speeds and main memory which is usually implemented with SDRAM. Below are a few examples of main memory speed and the introduction year for these standards. The cache provides a buffer between the registers and main memory to reduce the effects of the processor waiting on information from main memory. There are two main restrictions on this however. Firstly, cache is expensive. Secondly, when cache size is increased, so is the access time[10]. To maximize cache usefulness we need the L1 to be as fast as the processor or at least fast enough to load into the pipeline between an instruction being decoded and executed. So as has been noted many years ago, the growth rate of processor speed is much greater than the growth in DRAM speeds[8]. The difference in speeds are speculated to grow large enough that a "Memory Wall" will be reached if a solution is not found[8]. This states that once the divergence is large enough a system's speed will be solely determined by its memory speed. As can be seen from the table below CAS Latency (CL) times have slightly improved over the years, along with the data bus speed. (CAS Latency refers to the time to access a word in a given column in a row that is already open. Main memory can be viewed as a 2D array where you access the row, then column to fetch a word.) DDR3 bus speed is actually close to clock speed for todays processors. Latency can still be affected by row lookups however because if a row is not already open then it must be opened and this is usually the most expensive step in terms of time. Also main memory sizes have only gotten larger over the years, further increasing row and column access times. As to the memory wall however, DRAM cannot be the sole culprit for processor speed growth decreasing. As has been shown through the evolution of standard processor design, adding more levels of increasingly larger cache can help negate the effects of memory latency. Certain techniques can also be employed to combat the memory wall such as out-of-order (OOO) execution and speculative precomputation (SP) [11]. Physical cooling limits of current technology also limit processor speeds. All the hardware issues stated however can be explained as showing lack of progress due to lack of expenditure. Since the majority of funding for computers today derives from home-grade consumers a technology cannot be invested in if it cannot be show to have a decent chance at making back the investment in that market and consumers are only willing to pay so much. Especially when much cheaper options are available that can adequately meet their needs.

SDRAM: <1998
DDR: 2000
DDR2: 2003
DDR3: 2007

Memory timing examples (CAS latency only)
Generation	Type	Data rate	Bit time	Command rate	Cycle time	CL	First word	Fourth word	Eighth word
SDRAM	PC100	100 MT/s	10 ns	100 MHz	10 ns	2	20 ns	50 ns	90 ns
SDRAM	PC133	133 MT/s	7.5 ns	133 MHz	7.5 ns	3	22.5 ns	45 ns	75 ns
DDR SDRAM	DDR-333	333 MT/s	3 ns	166 MHz	6 ns	2.5	15 ns	24 ns	36 ns
	DDR-400	400 MT/s	2.5 ns	200 MHz	5 ns	3	15 ns	22.5 ns	32.5 ns
						2.5	12.5 ns	20 ns	30 ns
						2	10 ns	17.5 ns	27.5 ns
DDR2 SDRAM	DDR2-667	667 MT/s	1.5 ns	333 MHz	3 ns	5	15 ns	19.5 ns	25.5 ns
	DDR2-667	667 MT/s	1.5 ns	333 MHz	3 ns	4	12 ns	16.5 ns	22.5 ns
	DDR2-800	800 MT/s	1.25 ns	400 MHz	2.5 ns	6	15 ns	18.75 ns	23.75 ns
						5	12.5 ns	16.25 ns	21.25 ns
						4.5	11.25 ns	15 ns	20 ns
						4	10 ns	13.75 ns	18.75 ns
	DDR2-1066	1066 MT/s	0.95 ns	533 MHz	1.9 ns	7	13.13 ns	15.94 ns	19.69 ns
						6	11.25 ns	14.06 ns	17.81 ns
						5	9.38 ns	12.19 ns	15.94 ns
						4.5	8.44 ns	11.25 ns	15 ns
						4	7.5 ns	10.31 ns	14.06 ns
DDR3 SDRAM	DDR3-1066	1066 MT/s	0.9375 ns	533 MHz	1.875 ns	7	13.13 ns	15.95 ns	19.7 ns
	DDR3-1333	1333 MT/s	0.75 ns	666 MHz	1.5 ns	9	13.5 ns	15.75 ns	18.75 ns
	DDR3-1333	1333 MT/s	0.75 ns	666 MHz	1.5 ns	6	9 ns	11.25 ns	14.25 ns
	DDR3-1375	1375 MT/s	0.73 ns	687 MHz	1.5 ns	5	7.27 ns	9.45 ns	12.36 ns
	DDR3-1600	1600 MT/s	0.625 ns	800 MHz	1.25 ns	9	11.25 ns	13.125 ns	15.625 ns
						8	10 ns	11.875 ns	14.375 ns
						7	8.75 ns	10.625 ns	13.125 ns
						6	7.50 ns	9.375 ns	11.875 ns
	DDR3-2000	2000 MT/s	0.5 ns	1000 MHz	1 ns	10	10 ns	11.5 ns	13.5 ns
						9	9 ns	10.5 ns	12.5 ns
						8	8 ns	9.5 ns	11.5 ns
						7	7 ns	8.5 ns	10.5 ns
Generation	Type	Data rate	Bit time	Command rate	Cycle time	CL	First word	Fourth word	Eighth word

DRAM Memory Standards:

Standard	Mem Clock	Cycle time	I/O Bus Clock	Module Name	Peak Transfer Rate	Prefetch	Latency	Year
DDR-333	166 MHz	6 ns
DDR2-400	100MHz	10 ns	200 MHz	PC2-3200	3200 MB/s	4 n	4-6 Bus CC	2003
DDR2-533	133 MHz	7.5 ns	266	PC2-4200	4266 MB/s	4 n
DDR2-667	166 MHz	6 ns	333 MHz	PC2-5300	5333 MB/s	4 n
DDR2-800	200 MHz	5 ns	400 MHz	PC2-6400	6400 MB/s	4 n
DDR2-1066	266 MHz	3.75 ns	533 MHz	PC2-8500	8533 MB/s	4 n
DDR3-800	100 MHz	10 ns	400 MHz	PC2-6400	6400 MB/s	8 n	5-9 ns (7 avg.)	2007
DDR3-1066	133 MHz	7.5 ns	533 MHz	PC2-8500	8533 MB/s	8 n
DDR3-1333	166 MHz	6 ns	667 MHz	PC2-10600	10667 MB/s	8 n
DDR3-1600	200 MHz	5 ns	800 MHz	PC2-12800	12800 MB/s	8 n

Conclusion

Theory: Cache Associativity decreased as cache size became larger because it became too expensive to have to search the cache each time once the cache was too large. Also, bigger the cache size as a percentage of main memory, the less need for associativity. But while caches and main memory have both grown, main memory size has grown faster in the 2000’s. So when the percentage of cache to main memory goes down associativity needs to increase.

From Argwal et al. p. 6: Cache size increase increases access time, scale down's have improved this however

The Pentium/Pentium (1995)pro was the first processor to have the l2 cache on the processor chip. Before this, the l2 cache was an option to add on to the motherboard. [1]

Systems to consider in table

Pentium
amd
Mips
sun-microsystems: sparc
ibm: power pc
DEC: alpha

Penalty <100 when before 2000 after 2000 started to increase to get to main memory
< 20 1 level fine
<=100 2 level
>=200 3 level

miss rate reported, spec benchmarks >=200 3 level

miss rate reported, spec benchmarks

References

@@ Line 305: / Line 305: @@
 <br />
 ==Main Memory Specs==
-Finally main memory latency needs to be analyzed to see how it can affect the cache.  The cache is a necessary piece of hardware in the first place due to the severe disparity between processor speeds and main memory which is usually implemented with SDRAM.  Below are a few example of main memory speed and the introduction year for these standards.  The cache provides a buffer between the registers and main memory to reduce the effects of the processor waiting on information from main memory.  Two main restrictions on this however are that firstly cache is expensive.  Secondly, when cache size is increased, so is the access time[10].  To maximize cache usefulness we need the L1 to be as fast as the processor or at least fast enough to load into the pipeline between an instruction being decoded and executed.  So as has been noted many years ago, the growth rate of processor speed is much greater than the growth in DRAM speeds[8].  The difference in speeds are speculated to grow large enough that a "Memory Wall" will be reached if a solution is not found[8].  This states that once the divergence is large enough a system's speed will be solely determined by its memory speed.  As can be seen from the table below
+Finally main memory latency needs to be analyzed to see how it can affect the cache.  The cache is a necessary piece of hardware in the first place due to the severe disparity between processor speeds and main memory which is usually implemented with SDRAM.  Below are a few examples of main memory speed and the introduction year for these standards.  The cache provides a buffer between the registers and main memory to reduce the effects of the processor waiting on information from main memory.  There are two main restrictions on this however.  Firstly, cache is expensive.  Secondly, when cache size is increased, so is the access time[10].  To maximize cache usefulness we need the L1 to be as fast as the processor or at least fast enough to load into the pipeline between an instruction being decoded and executed.  So as has been noted many years ago, the growth rate of processor speed is much greater than the growth in DRAM speeds[8].  The difference in speeds are speculated to grow large enough that a "Memory Wall" will be reached if a solution is not found[8].  This states that once the divergence is large enough a system's speed will be solely determined by its memory speed.  As can be seen from the table below CAS Latency (CL) times have slightly improved over the years, along with the data bus speed. (CAS Latency refers to the time to access a word in a given column in a row that is already open.  Main memory can be viewed as a 2D array where you access the row, then column to fetch a word.)  DDR3 bus speed is actually close to clock speed for todays processors.  Latency can still be affected by row lookups however because if a row is not already open then it must be opened and this is usually the most expensive step in terms of time.  Also main memory sizes have only gotten larger over the years, further increasing row and column access times.  As to the memory wall however, DRAM cannot be the sole culprit for processor speed growth decreasing.  As has been shown through the evolution of standard processor design, adding more levels of increasingly larger cache can help negate the effects of memory latency.  Certain techniques can also be employed to combat the memory wall such as out-of-order (OOO) execution and speculative precomputation (SP) [11].  Physical cooling limits of current technology also limit processor speeds.  All the hardware issues stated however can be explained as showing lack of progress due to lack of expenditure.  Since the majority of funding for computers today derives from home-grade consumers a technology cannot be invested in if it cannot be show to have a decent chance at making back the investment in that market and consumers are only willing to pay so much.  Especially when much cheaper options are available that can adequately meet their needs.
 <br />

CSC 456 Fall 2013/1c wa: Difference between revisions

Revision as of 22:02, 3 October 2013

Contents

Trends in cache size and organization

Introduction

Cache Associativity

Cache Size

Main Memory Specs

Conclusion

References

Navigation menu

CSC 456 Fall 2013/1c wa: Difference between revisions

Revision as of 22:02, 3 October 2013

Trends in cache size and organization

Introduction

Cache Associativity

Cache Size

Main Memory Specs

Conclusion

References

Navigation menu

Search