CSC 456 Fall 2013/1c wa: Difference between revisions
No edit summary |
|||
(59 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
==Trends in cache size and organization== | |||
--------- | --------- | ||
==Introduction== | |||
Cache size has grown over the years alongside the evolution of the microprocessor. Intuitively one would expect cache sizes to keep growing larger and larger following some law similar to Moore’s Law. In actuality however L1 cache sizes have all but maxed out for an individual processor. Observing the trend of cache growth it can be seen that some processor lines stopped growing from one iteration to the next and in some cases even decreased in size. To go along with this, cache associativity has varied over the years. While it is true that no cache organization is optimal for every situation certain organizations certainly perform better for most tasks on certain systems. This wiki will try to analyze data on cache size and associativities to gain some insight into the trends and reasoning behind vendor choices of cache size and organization over the years. Specifically it looks from the late 80’s / early 90’s to the early 2000’s. | |||
<br /> | |||
==Cache Associativity== | |||
This table shows cache associativities found in some mainstream processors from the late 80’s to the early 2000’s with one processor from 1968 just for reference. As can be seen from the data, the late 80’s early 90’s tended towards a set associative cache with around four lines. In the mid-90’s it tended towards lower associativity and direct mapping. Then in the late 90’s and early 2000’s it tended back towards higher associativities with larger set sizes again. | |||
<br /> | <br /> | ||
'''L1, L2, L3 Associativity''' | |||
{|border=1 | |||
| '''System''' | |||
| '''Year''' | |||
| '''L1 Associativity''' | |||
| '''L2 Associativity''' | |||
| '''L3 Associativity''' | |||
''' | |||
| Year | |||
| | |||
| | |||
| | |||
| Notes: | | Notes: | ||
|- | |- | ||
| IBM 360/85 | |||
| 1968 | |||
| | | Sector | ||
| | | N/A | ||
| | | N/A | ||
| | | First processor with a cache, clock speed 12.5MHz | ||
| | |||
|- | |- | ||
| Intel 80486 | |||
| 1989 | |||
| 4-way associative | |||
| N/A | |||
| | | N/A | ||
| | |||
| | |||
| | |||
| | | | ||
|- | |- | ||
| SuperSPARC | |||
| 1992 | |||
| 4 & 5 way set | |||
| | | N/A | ||
| 4 | | N/A | ||
| | | Used to render Toy Story, Core @ 40MHz | ||
| | |||
| | |||
|- | |- | ||
| Alpha 21064(DEC) | |||
| | | 1992 | ||
| Direct | | Direct | ||
| Direct | | Direct | ||
| | | N/A | ||
| | | | ||
|- | |- | ||
| UltraSPARC | |||
| | | 1995 | ||
| 2-Way & Direct | | 2-Way & Direct | ||
| Direct | | Direct | ||
| N/A | | N/A | ||
| 64-bit w/ Core@200MHz | | 64-bit w/ Core@200MHz | ||
|- | |- | ||
| Alpha 21164(DEC) | |||
| | | 1995 | ||
| Direct | | Direct | ||
| 3 way set | | 3 way set | ||
| | | N/A | ||
| | | | ||
|- | |- | ||
| Pentium Pro | |||
| 1995 | |||
| 2 & 4 way | |||
| 4 | |||
| ? | | ? | ||
| N/A | | N/A | ||
| | | First on-die L2 | ||
| | |- | ||
| | | K6-III | ||
| | | 1999 | ||
| | | 2 way | ||
| 4 way | |||
| n/a | |||
| | |||
|- | |- | ||
| | | Pentium 4 | ||
| | | 10/2000 | ||
| 4 Way | |||
| 8 Way | |||
| | |||
| | |||
| N/A | | N/A | ||
| | |||
|- | |||
| UltraSPARC III | |||
| 2001 | |||
| 4 Way | |||
| N/A | | N/A | ||
| N/A | | N/A | ||
| | | | ||
|- | |- | ||
| Itanium 2 | |||
| 2002 | |||
| 4 -way | |||
| 8-way | |||
| 12 way | |||
| | |||
|} | |||
<br /><br /><br /> | |||
==Cache Size== | |||
In accordance with Moore's law as the transistors on a chip increase we would expect cache sizes to increase with each generation of processors. Main memory sizes have certainly kept increasing so we would expect to see a similar trend in caches. Looking at the table below we can certainly see an increase in L1 cache sizes all the way up to the 2000's. Analyzing the trend however we can see some irregularities in the 90's. At certain stages we can see cache size growth stall and even decrease in some iterations for an individual vendor. The Pentium to the Pentium Pro for instance both had 16 KB L1 caches. The Pro however was the first processor to have an on-die L2. From 1992 when the SuperSPARC came out with 36 KB of L1 to 1995 the UltraSPARC decreased to a 32 KB L1. In this instance though the L2 size capacity increased. So while sometimes an individual cache size may remain the same or even decrease this is usually accompanied by another change. As can be deduced from the table however, the typical L1 cache size per core has leveled out at 64 KB around 1999. | |||
<br /><br /> | |||
''' L1, L2, L3 Size by Year ''' | |||
{|border=1 | |||
| '''Processor''' | |||
| '''System Type''' | |||
| '''Year''' | |||
| '''L1 size''' | |||
| '''L2 size''' | |||
| '''L3 size''' | |||
|- | |||
| IBM 360/85 | |||
| Mainframe | |||
| 1968 | |||
| 16 to 32 KB | |||
| — | |||
| — | |||
|- | |||
| PDP-11/70 | |||
| Minicomputer | |||
| 1975 | |||
| 1 KB | |||
| — | |||
| — | |||
|- | |||
| VAX 11/780 | |||
| Minicomputer | |||
| 1978 | |||
| 16 KB | |||
| — | |||
| — | |||
|- | |||
| IBM 3033 | |||
| Mainframe | |||
| 1978 | |||
| 64 KB | |||
| — | |||
| — | |||
|- | |||
| IBM 3090 | |||
| Mainframe | |||
| 1985 | |||
| 128 to 256 KB | |||
| — | |||
| — | |||
|- | |||
| Intel 80486 | | Intel 80486 | ||
| PC | |||
| 1989 | |||
| 8 KB | |||
| — | |||
| — | |||
|- | |||
| SuperSPARC | |||
| PC | |||
| 1992 | |||
| 16 KB/20 KB | |||
| 0 to 2 MB | |||
| — | |||
|- | |||
| Pentium | |||
| PC | |||
| 1993 | |||
| 8 KB/8 KB | |||
| 256 to 512 KB | |||
| — | |||
|- | |||
| PowerPC 601 | |||
| PC | |||
| 1993 | |||
| 32 KB | |||
| — | |||
| — | |||
|- | |||
| UltraSPARC | |||
| PC | |||
| 1995 | |||
| 16 KB/16 KB | |||
| 512 KB to 4 MB | |||
| — | |||
|- | |||
| Pentium Pro | |||
| PC | |||
| 1995 | |||
| 8 KB/8 KB | |||
| 256 KB - 1 MB | |||
| — | |||
|- | |||
| PowerPC | |||
| 620 PC | |||
| 1996 | |||
| 32 KB/32 KB | |||
| — | |||
| — | |||
|- | |||
| PowerPC G4 | |||
| PC/server | |||
| 1999 | |||
| 32 KB/32 KB | |||
| 256 KB to 1 MB | |||
| 2 MB | |||
|- | |||
| IBM S/390 G4 | |||
| Mainframe | |||
| 1997 | |||
| 32 KB | |||
| 256 KB | |||
| 2 MB | |||
|- | |||
| IBM S/390 G6 | |||
| Mainframe | |||
| 1999 | |||
| 256 KB | |||
| 8 MB | |||
| — | |||
|- | |||
| Pentium 4 | |||
| PC/server | |||
| 2000 | |||
| 8 KB/8 KB | |||
| 256 KB | |||
| — | |||
|- | |||
| IBM SP | |||
| High-end server | |||
| 2000 | |||
| 64 KB/32 KB | |||
| 8 MB | |||
| — | |||
|- | |||
| CRAY MTAb | |||
| Supercomputer | |||
| 2000 | |||
| 8 KB | |||
| 2 MB | |||
| — | |||
|- | |||
| UltraSPARCIII | |||
| PC | |||
| 2001 | |||
| 32 KB/64 KB | |||
| 2 to 8 MB | |||
| — | |||
|- | |||
| Itanium | |||
| PC/server | |||
| 2001 | |||
| 16 KB/16 KB | |||
| 96 KB | |||
| 4 MB | |||
|- | |||
| SGI Origin 2001 | |||
| High-end server | |||
| 2001 | |||
| 32 KB/32 KB | |||
| 4 MB | |||
| — | |||
|- | |||
| Itanium 2 | |||
| PC/server | |||
| 2002 | |||
| 32 KB | |||
| 256 KB | |||
| 6 MB | |||
|- | |||
| IBM POWER5 | |||
| High-end server | |||
| 2003 | |||
| 64 KB | |||
| 1.9 MB | |||
| 36 MB | |||
|- | |||
| CRAY XD-1 | |||
| Supercomputer | |||
| 2004 | |||
| 64 KB/64 KB | |||
| 1MB | |||
| — | |||
|- | |||
| Nehalem (i5,7, Xenon) | |||
| PC, Server | |||
| 2008 | |||
| 32 KB/32 KB /core | |||
| 256 KB per | |||
| 4 MB to 12 MB total | |||
|- | |||
| Sandy Bridge (i3-7, Pent.) | |||
| PC, Server | |||
| 2011 | |||
| 32 KB/32 KB /core | |||
| 256 KB per | |||
| 1 MB to 20 MB total | |||
|} | |||
<br /> | |||
<br /> | |||
==Main Memory Issues== | |||
Finally main memory latency needs to be analyzed to see how it can affect the cache. The cache is a necessary piece of hardware in the first place due to the severe disparity between processor speeds and main memory which is usually implemented with SDRAM. Below are a few examples of main memory speed and the introduction year for these standards. The cache provides a buffer between the registers and main memory to reduce the effects of the processor waiting on information from main memory. There are two main restrictions on this however. Firstly, cache is expensive. Secondly, when cache size is increased, so is the access time[10]. To maximize cache usefulness we need the L1 to be as fast as the processor or at least fast enough to load into the pipeline between an instruction being decoded and executed. So as has been noted many years ago, the growth rate of processor speed is much greater than the growth in DRAM speeds[8]. The difference in speeds are speculated to grow large enough that a "Memory Wall" will be reached if a solution is not found[8]. This states that once the divergence is large enough a system's speed will be solely determined by its memory speed. As can be seen from the table below CAS Latency (CL) times have slightly improved over the years, along with the data bus speed. (CAS Latency refers to the time to access a word in a given column in a row that is already open. Main memory can be viewed as a 2D array where you access the row, then column to fetch a word.) DDR3 bus speed is actually close to clock speed for today's processors. Latency can still be affected by row lookups however because if a row is not already open then it must be opened and this is usually the most expensive step in terms of time. As to the memory wall however, DRAM cannot be the sole culprit for processor speed growth decreasing. As has been shown through the evolution of standard processor design, adding more levels of increasingly larger cache can help negate the effects of a growing memory latency. Certain techniques can also be employed to combat the memory wall such as out-of-order (OOO) execution and speculative precomputation (SP) [11]. Physical cooling limits of current technology also limit processor speeds. All the hardware issues stated however can be explained as showing lack of progress due to lack of expenditure. Since the majority of funding for computers today derives from home-grade consumers, a technology cannot be invested in if it cannot be shown to have a strong chance of recovering its investment. Currently the level of capital needed keeps getting higher and the improvement of each generation is getting smaller. So in order to make the next generation fast enough it may make said processors too expensive to be mass marketed. The trend of consumer computing towards mobile makes speedups less important than mobility too, further sidelining the memory wall. | |||
<br /> | |||
SDRAM: <1998 <br /> | |||
DDR: 2000 <br /> | |||
DDR2: 2003 <br /> | |||
DDR3: 2007 <br /> | |||
{| class="wikitable" | |||
|+Memory timing examples (CAS latency only) | |||
! Generation | |||
! Type | |||
! Data rate | |||
! Bit time | |||
! Command rate | |||
! Cycle time | |||
! CL | |||
! First word | |||
! Fourth word | |||
! Eighth word | |||
|- | |||
| rowspan="2" | SDRAM | |||
| PC100 | |||
|align=right| 100 MT/s | |||
| 10 ns | |||
|align=right| 100 MHz | |||
| 10 ns | |||
| 2 | |||
| 20 ns | |||
| 50 ns | |||
| 90 ns | |||
|- | |||
| PC133 | |||
|align=right| 133 MT/s | |||
| 7.5 ns | |||
|align=right| 133 MHz | |||
| 7.5 ns | |||
| 3 | |||
| 22.5 ns | |||
| 45 ns | |||
| 75 ns | |||
|- | |||
| rowspan="4" | DDR SDRAM | |||
| DDR-333 | |||
|align=right| 333 MT/s | |||
| 3 ns | |||
|align=right| 166 MHz | |||
| 6 ns | |||
| 2.5 | |||
| 15 ns | |||
| 24 ns | |||
| 36 ns | |||
|- | |||
|rowspan=3| DDR-400 | |||
|rowspan=3 align=right| 400 MT/s | |||
|rowspan=3| 2.5 ns | |||
|rowspan=3 align=right| 200 MHz | |||
|rowspan=3| 5 ns | |||
| 3 | |||
| 15 ns | |||
| 22.5 ns | |||
| 32.5 ns | |||
|- | |||
| 2.5 | |||
| 12.5 ns | |||
| 20 ns | |||
| 30 ns | |||
|- | |||
| 2 | |||
| 10 ns | |||
| 17.5 ns | |||
| 27.5 ns | |||
|- | |||
| rowspan="11" | DDR2 SDRAM | |||
|rowspan=2| DDR2-667 | |||
|rowspan=2 align=right| 667 MT/s | |||
|rowspan=2 |1.5 ns | |||
|rowspan=2 align=right| 333 MHz | |||
|rowspan=2| 3 ns | |||
| 5 | |||
| 15 ns | |||
| 19.5 ns | |||
| 25.5 ns | |||
|- | |||
|4 | |||
| 12 ns | |||
| 16.5 ns | |||
| 22.5 ns | |||
|- | |||
|rowspan=4| DDR2-800 | |||
|rowspan=4 align=right| 800 MT/s | |||
|rowspan=4| 1.25 ns | |||
|rowspan=4 align=right| 400 MHz | |||
|rowspan=4| 2.5 ns | |||
| 6 | |||
| 15 ns | |||
| 18.75 ns | |||
| 23.75 ns | |||
|- | |||
| 5 | |||
| 12.5 ns | |||
| 16.25 ns | |||
| 21.25 ns | |||
|- | |||
| 4.5 | |||
| 11.25 ns | |||
| 15 ns | |||
| 20 ns | |||
|- | |||
| 4 | |||
| 10 ns | |||
| 13.75 ns | |||
| 18.75 ns | |||
|- | |||
|rowspan=5| DDR2-1066 | |||
|rowspan=5 align=right| 1066 MT/s | |||
|rowspan=5| 0.95 ns | |||
|rowspan=5 align=right| 533 MHz | |||
|rowspan=5| 1.9 ns | |||
| 7 | |||
| 13.13 ns | |||
| 15.94 ns | |||
| 19.69 ns | |||
|- | |||
| 6 | |||
| 11.25 ns | |||
| 14.06 ns | |||
| 17.81 ns | |||
|- | |||
| 5 | |||
| 9.38 ns | |||
| 12.19 ns | |||
| 15.94 ns | |||
|- | |||
| 4.5 | |||
| 8.44 ns | |||
| 11.25 ns | |||
| 15 ns | |||
|- | |||
| 4 | |||
| 7.5 ns | |||
| 10.31 ns | |||
| 14.06 ns | |||
|- | |||
| rowspan="12" | DDR3 SDRAM | |||
| DDR3-1066 | |||
|align=right| 1066 MT/s | |||
| 0.9375 ns | |||
|align=right| 533 MHz | |||
| 1.875 ns | |||
| 7 | |||
| 13.13 ns | |||
| 15.95 ns | |||
| 19.7 ns | |||
|- | |||
|rowspan=2| DDR3-1333 | |||
|rowspan=2 align=right| 1333 MT/s | |||
|rowspan=2| 0.75 ns | |||
|rowspan=2 align=right| 666 MHz | |||
|rowspan=2| 1.5 ns | |||
| 9 | |||
| 13.5 ns | |||
| 15.75 ns | |||
| 18.75 ns | |||
|- | |||
| 6 | |||
| 9 ns | |||
| 11.25 ns | |||
| 14.25 ns | |||
|- | |||
| DDR3-1375 | |||
|align=right| 1375 MT/s | |||
| 0.73 ns | |||
|align=right| 687 MHz | |||
| 1.5 ns | |||
| 5 | |||
| 7.27 ns | |||
| 9.45 ns | |||
| 12.36 ns | |||
|- | |||
|rowspan=4| DDR3-1600 | |||
|rowspan=4 align=right| 1600 MT/s | |||
|rowspan=4| 0.625 ns | |||
|rowspan=4 align=right| 800 MHz | |||
|rowspan=4| 1.25 ns | |||
| 9 | |||
| 11.25 ns | |||
| 13.125 ns | |||
| 15.625 ns | |||
|- | |||
| 8 | |||
| 10 ns | |||
| 11.875 ns | |||
| 14.375 ns | |||
|- | |||
| 7 | |||
| 8.75 ns | |||
| 10.625 ns | |||
| 13.125 ns | |||
|- | |||
| 6 | |||
| 7.50 ns | |||
| 9.375 ns | |||
| 11.875 ns | |||
|- | |||
|rowspan=4| DDR3-2000 | |||
|rowspan=4 align=right| 2000 MT/s | |||
|rowspan=4| 0.5 ns | |||
|rowspan=4 align=right| 1000 MHz | |||
|rowspan=4| 1 ns | |||
| 10 | |||
| 10 ns | |||
| 11.5 ns | |||
| 13.5 ns | |||
|- | |||
| 9 | |||
| 9 ns | |||
| 10.5 ns | |||
| 12.5 ns | |||
|- | |||
| 8 | |||
| 8 ns | |||
| 9.5 ns | |||
| 11.5 ns | |||
|- | |- | ||
| | | 7 | ||
| 7 ns | |||
| 8.5 ns | |||
| 10.5 ns | |||
|- | |- | ||
! Generation | |||
! Type | |||
! Data rate | |||
! Bit time | |||
! Command rate | |||
! Cycle time | |||
! CL | |||
! First word | |||
! Fourth word | |||
! Eighth word | |||
|} | |||
<br /> | |||
==Conclusion== | |||
During the early days of the PC the size difference between main memory and cache size was nowhere near what it is today. As low level cache sizes have tended to reach a maximum over the years, main memory keeps getting larger and larger. We can loosely follow this by looking at operating system requirements for main memory over the years compared to an average PC processor cache for the time period. Because operating systems are competitive the developers want to pack as much capability into them as possible, which they will usually do by writing an OS that requires at least the minimum average processor specifications at the time. For instance Windows 1.0 required 256 KB of RAM to run[12]. Compare that to the 8 KB available in in the Intel 80486 in 1989, 4 years after Windows 1.0 was released. This gives us a percentage of 3.1% (2^-5). Windows 95 recommended 8 MB of RAM for an installation[13]. Compare this to the 16 KB available in the Pentium Pro, a ubiquitous processor back then. It comes out to 0.2% (2^-9). In 2001 XP came out with a RAM recommendation of 128 MB[14]. If we compare that to an Itanium 2 with 32 KB L1 which came out after XP in 2002 we get a percentage of 0.024% (2^-12). Since then L1 caches have not changed much but we have RAMs on the order of gigabytes now. This gap in sizes between main memory and low level cache can be seen as a reason for associativity increasing. As the percentage of our cache size to main memory decreases, misses from direct mapping will increase dramatically. | |||
On the other hand as was pointed out earlier, associativity can be seen to have a slight trend decreasing in the mid-90's before increasing again. As we noted in the previous paragraph L1 size compared to main memory was most likely the main cause for the increase in associativity. However since this happened around the time cache size growth was seen to stall and even backpedal in some cases maybe there is a correlation. It is possible that due to processor speeds increasing engineers could not develop a cache that was both larger and faster at the same pace. So it had to either be larger or faster. In order to keep up with these rapid speed increases at the time they may have had to sacrifice associativity since it can slow a cache down by searching. Once these technological hurdles were overcome though size and associativity could increase again. | |||
<br /> | |||
==References== | |||
<references/> | |||
[http://www.chips.5u.com/idxhst.html | <ol> | ||
[http://en.wikipedia.org/wiki/List_of_Intel_microprocessors Intel Processors] | <li>[http://download.intel.com/design/itanium2/manuals/25111003.pdf Itanium Specs(p.20)] Intel Datasheet</li> | ||
[http://en.wikipedia.org/wiki/Intel_80486 First on-board L1] | <li>[http://www.chips.5u.com/idxhst.html Cache Evolution] </li> | ||
[faculty.washington.edu/lcrum/Archives/TCSS372AS07/Slides04_05. | <li>[http://en.wikipedia.org/wiki/List_of_Intel_microprocessors Intel Processors] Wikipedia</li> | ||
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1034.pdf] | <li>[http://en.wikipedia.org/wiki/Intel_80486 First on-board L1] Wikipedia</li> | ||
<li>[http://faculty.washington.edu/lcrum/Archives/TCSS372AS07/Slides04_05.ppt Cache Trend Table] </li> | |||
<li>[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1034.pdf Sector Caches] </li> | |||
<li>[http://www.tomshardware.com/reviews/ram-speed-tests,1807-3.html DDR2/3 Speeds] </li> | |||
<li>[http://www.eecs.ucf.edu/~lboloni/Teaching/EEL5708_2006/slides/wulf94.pdf Memory Wall] Wulf & McKee</li> | |||
<li>[http://support.gateway.com/s/Servers/shared/pproprsr/pentpro.shtml Pentium Pro Specs] Gateway Datasheet</li> | |||
<li>[http://www.cs.utexas.edu/users/cart/trips/publications/isca00.pdf Clock Rate vs. IPC] Argawal et al. </li> | |||
<li>[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=19C67C93D13D430FE9ECD17FD57D2142?doi=10.1.1.134.6195&rep=rep1&type=pdf Mem Latency-Tol. Methods] Wang et al. </li> | |||
<li>[http://en.wikipedia.org/wiki/Windows_1.0 Win1.0] Wikipedia </li> | |||
<li>[http://support.microsoft.com/kb/138349 Win95] Microsoft </li> | |||
<li>[http://support.microsoft.com/kb/314865 WinXP] Microsoft </li> | |||
</ol> | |||
<br /> |
Latest revision as of 22:07, 6 October 2013
Trends in cache size and organization
Introduction
Cache size has grown over the years alongside the evolution of the microprocessor. Intuitively one would expect cache sizes to keep growing larger and larger following some law similar to Moore’s Law. In actuality however L1 cache sizes have all but maxed out for an individual processor. Observing the trend of cache growth it can be seen that some processor lines stopped growing from one iteration to the next and in some cases even decreased in size. To go along with this, cache associativity has varied over the years. While it is true that no cache organization is optimal for every situation certain organizations certainly perform better for most tasks on certain systems. This wiki will try to analyze data on cache size and associativities to gain some insight into the trends and reasoning behind vendor choices of cache size and organization over the years. Specifically it looks from the late 80’s / early 90’s to the early 2000’s.
Cache Associativity
This table shows cache associativities found in some mainstream processors from the late 80’s to the early 2000’s with one processor from 1968 just for reference. As can be seen from the data, the late 80’s early 90’s tended towards a set associative cache with around four lines. In the mid-90’s it tended towards lower associativity and direct mapping. Then in the late 90’s and early 2000’s it tended back towards higher associativities with larger set sizes again.
L1, L2, L3 Associativity
System | Year | L1 Associativity | L2 Associativity | L3 Associativity | Notes: |
IBM 360/85 | 1968 | Sector | N/A | N/A | First processor with a cache, clock speed 12.5MHz |
Intel 80486 | 1989 | 4-way associative | N/A | N/A | |
SuperSPARC | 1992 | 4 & 5 way set | N/A | N/A | Used to render Toy Story, Core @ 40MHz |
Alpha 21064(DEC) | 1992 | Direct | Direct | N/A | |
UltraSPARC | 1995 | 2-Way & Direct | Direct | N/A | 64-bit w/ Core@200MHz |
Alpha 21164(DEC) | 1995 | Direct | 3 way set | N/A | |
Pentium Pro | 1995 | 2 & 4 way | ? | N/A | First on-die L2 |
K6-III | 1999 | 2 way | 4 way | n/a | |
Pentium 4 | 10/2000 | 4 Way | 8 Way | N/A | |
UltraSPARC III | 2001 | 4 Way | N/A | N/A | |
Itanium 2 | 2002 | 4 -way | 8-way | 12 way |
Cache Size
In accordance with Moore's law as the transistors on a chip increase we would expect cache sizes to increase with each generation of processors. Main memory sizes have certainly kept increasing so we would expect to see a similar trend in caches. Looking at the table below we can certainly see an increase in L1 cache sizes all the way up to the 2000's. Analyzing the trend however we can see some irregularities in the 90's. At certain stages we can see cache size growth stall and even decrease in some iterations for an individual vendor. The Pentium to the Pentium Pro for instance both had 16 KB L1 caches. The Pro however was the first processor to have an on-die L2. From 1992 when the SuperSPARC came out with 36 KB of L1 to 1995 the UltraSPARC decreased to a 32 KB L1. In this instance though the L2 size capacity increased. So while sometimes an individual cache size may remain the same or even decrease this is usually accompanied by another change. As can be deduced from the table however, the typical L1 cache size per core has leveled out at 64 KB around 1999.
L1, L2, L3 Size by Year
Processor | System Type | Year | L1 size | L2 size | L3 size |
IBM 360/85 | Mainframe | 1968 | 16 to 32 KB | — | — |
PDP-11/70 | Minicomputer | 1975 | 1 KB | — | — |
VAX 11/780 | Minicomputer | 1978 | 16 KB | — | — |
IBM 3033 | Mainframe | 1978 | 64 KB | — | — |
IBM 3090 | Mainframe | 1985 | 128 to 256 KB | — | — |
Intel 80486 | PC | 1989 | 8 KB | — | — |
SuperSPARC | PC | 1992 | 16 KB/20 KB | 0 to 2 MB | — |
Pentium | PC | 1993 | 8 KB/8 KB | 256 to 512 KB | — |
PowerPC 601 | PC | 1993 | 32 KB | — | — |
UltraSPARC | PC | 1995 | 16 KB/16 KB | 512 KB to 4 MB | — |
Pentium Pro | PC | 1995 | 8 KB/8 KB | 256 KB - 1 MB | — |
PowerPC | 620 PC | 1996 | 32 KB/32 KB | — | — |
PowerPC G4 | PC/server | 1999 | 32 KB/32 KB | 256 KB to 1 MB | 2 MB |
IBM S/390 G4 | Mainframe | 1997 | 32 KB | 256 KB | 2 MB |
IBM S/390 G6 | Mainframe | 1999 | 256 KB | 8 MB | — |
Pentium 4 | PC/server | 2000 | 8 KB/8 KB | 256 KB | — |
IBM SP | High-end server | 2000 | 64 KB/32 KB | 8 MB | — |
CRAY MTAb | Supercomputer | 2000 | 8 KB | 2 MB | — |
UltraSPARCIII | PC | 2001 | 32 KB/64 KB | 2 to 8 MB | — |
Itanium | PC/server | 2001 | 16 KB/16 KB | 96 KB | 4 MB |
SGI Origin 2001 | High-end server | 2001 | 32 KB/32 KB | 4 MB | — |
Itanium 2 | PC/server | 2002 | 32 KB | 256 KB | 6 MB |
IBM POWER5 | High-end server | 2003 | 64 KB | 1.9 MB | 36 MB |
CRAY XD-1 | Supercomputer | 2004 | 64 KB/64 KB | 1MB | — |
Nehalem (i5,7, Xenon) | PC, Server | 2008 | 32 KB/32 KB /core | 256 KB per | 4 MB to 12 MB total |
Sandy Bridge (i3-7, Pent.) | PC, Server | 2011 | 32 KB/32 KB /core | 256 KB per | 1 MB to 20 MB total |
Main Memory Issues
Finally main memory latency needs to be analyzed to see how it can affect the cache. The cache is a necessary piece of hardware in the first place due to the severe disparity between processor speeds and main memory which is usually implemented with SDRAM. Below are a few examples of main memory speed and the introduction year for these standards. The cache provides a buffer between the registers and main memory to reduce the effects of the processor waiting on information from main memory. There are two main restrictions on this however. Firstly, cache is expensive. Secondly, when cache size is increased, so is the access time[10]. To maximize cache usefulness we need the L1 to be as fast as the processor or at least fast enough to load into the pipeline between an instruction being decoded and executed. So as has been noted many years ago, the growth rate of processor speed is much greater than the growth in DRAM speeds[8]. The difference in speeds are speculated to grow large enough that a "Memory Wall" will be reached if a solution is not found[8]. This states that once the divergence is large enough a system's speed will be solely determined by its memory speed. As can be seen from the table below CAS Latency (CL) times have slightly improved over the years, along with the data bus speed. (CAS Latency refers to the time to access a word in a given column in a row that is already open. Main memory can be viewed as a 2D array where you access the row, then column to fetch a word.) DDR3 bus speed is actually close to clock speed for today's processors. Latency can still be affected by row lookups however because if a row is not already open then it must be opened and this is usually the most expensive step in terms of time. As to the memory wall however, DRAM cannot be the sole culprit for processor speed growth decreasing. As has been shown through the evolution of standard processor design, adding more levels of increasingly larger cache can help negate the effects of a growing memory latency. Certain techniques can also be employed to combat the memory wall such as out-of-order (OOO) execution and speculative precomputation (SP) [11]. Physical cooling limits of current technology also limit processor speeds. All the hardware issues stated however can be explained as showing lack of progress due to lack of expenditure. Since the majority of funding for computers today derives from home-grade consumers, a technology cannot be invested in if it cannot be shown to have a strong chance of recovering its investment. Currently the level of capital needed keeps getting higher and the improvement of each generation is getting smaller. So in order to make the next generation fast enough it may make said processors too expensive to be mass marketed. The trend of consumer computing towards mobile makes speedups less important than mobility too, further sidelining the memory wall.
SDRAM: <1998
DDR: 2000
DDR2: 2003
DDR3: 2007
Generation | Type | Data rate | Bit time | Command rate | Cycle time | CL | First word | Fourth word | Eighth word |
---|---|---|---|---|---|---|---|---|---|
SDRAM | PC100 | 100 MT/s | 10 ns | 100 MHz | 10 ns | 2 | 20 ns | 50 ns | 90 ns |
PC133 | 133 MT/s | 7.5 ns | 133 MHz | 7.5 ns | 3 | 22.5 ns | 45 ns | 75 ns | |
DDR SDRAM | DDR-333 | 333 MT/s | 3 ns | 166 MHz | 6 ns | 2.5 | 15 ns | 24 ns | 36 ns |
DDR-400 | 400 MT/s | 2.5 ns | 200 MHz | 5 ns | 3 | 15 ns | 22.5 ns | 32.5 ns | |
2.5 | 12.5 ns | 20 ns | 30 ns | ||||||
2 | 10 ns | 17.5 ns | 27.5 ns | ||||||
DDR2 SDRAM | DDR2-667 | 667 MT/s | 1.5 ns | 333 MHz | 3 ns | 5 | 15 ns | 19.5 ns | 25.5 ns |
4 | 12 ns | 16.5 ns | 22.5 ns | ||||||
DDR2-800 | 800 MT/s | 1.25 ns | 400 MHz | 2.5 ns | 6 | 15 ns | 18.75 ns | 23.75 ns | |
5 | 12.5 ns | 16.25 ns | 21.25 ns | ||||||
4.5 | 11.25 ns | 15 ns | 20 ns | ||||||
4 | 10 ns | 13.75 ns | 18.75 ns | ||||||
DDR2-1066 | 1066 MT/s | 0.95 ns | 533 MHz | 1.9 ns | 7 | 13.13 ns | 15.94 ns | 19.69 ns | |
6 | 11.25 ns | 14.06 ns | 17.81 ns | ||||||
5 | 9.38 ns | 12.19 ns | 15.94 ns | ||||||
4.5 | 8.44 ns | 11.25 ns | 15 ns | ||||||
4 | 7.5 ns | 10.31 ns | 14.06 ns | ||||||
DDR3 SDRAM | DDR3-1066 | 1066 MT/s | 0.9375 ns | 533 MHz | 1.875 ns | 7 | 13.13 ns | 15.95 ns | 19.7 ns |
DDR3-1333 | 1333 MT/s | 0.75 ns | 666 MHz | 1.5 ns | 9 | 13.5 ns | 15.75 ns | 18.75 ns | |
6 | 9 ns | 11.25 ns | 14.25 ns | ||||||
DDR3-1375 | 1375 MT/s | 0.73 ns | 687 MHz | 1.5 ns | 5 | 7.27 ns | 9.45 ns | 12.36 ns | |
DDR3-1600 | 1600 MT/s | 0.625 ns | 800 MHz | 1.25 ns | 9 | 11.25 ns | 13.125 ns | 15.625 ns | |
8 | 10 ns | 11.875 ns | 14.375 ns | ||||||
7 | 8.75 ns | 10.625 ns | 13.125 ns | ||||||
6 | 7.50 ns | 9.375 ns | 11.875 ns | ||||||
DDR3-2000 | 2000 MT/s | 0.5 ns | 1000 MHz | 1 ns | 10 | 10 ns | 11.5 ns | 13.5 ns | |
9 | 9 ns | 10.5 ns | 12.5 ns | ||||||
8 | 8 ns | 9.5 ns | 11.5 ns | ||||||
7 | 7 ns | 8.5 ns | 10.5 ns | ||||||
Generation | Type | Data rate | Bit time | Command rate | Cycle time | CL | First word | Fourth word | Eighth word |
Conclusion
During the early days of the PC the size difference between main memory and cache size was nowhere near what it is today. As low level cache sizes have tended to reach a maximum over the years, main memory keeps getting larger and larger. We can loosely follow this by looking at operating system requirements for main memory over the years compared to an average PC processor cache for the time period. Because operating systems are competitive the developers want to pack as much capability into them as possible, which they will usually do by writing an OS that requires at least the minimum average processor specifications at the time. For instance Windows 1.0 required 256 KB of RAM to run[12]. Compare that to the 8 KB available in in the Intel 80486 in 1989, 4 years after Windows 1.0 was released. This gives us a percentage of 3.1% (2^-5). Windows 95 recommended 8 MB of RAM for an installation[13]. Compare this to the 16 KB available in the Pentium Pro, a ubiquitous processor back then. It comes out to 0.2% (2^-9). In 2001 XP came out with a RAM recommendation of 128 MB[14]. If we compare that to an Itanium 2 with 32 KB L1 which came out after XP in 2002 we get a percentage of 0.024% (2^-12). Since then L1 caches have not changed much but we have RAMs on the order of gigabytes now. This gap in sizes between main memory and low level cache can be seen as a reason for associativity increasing. As the percentage of our cache size to main memory decreases, misses from direct mapping will increase dramatically.
On the other hand as was pointed out earlier, associativity can be seen to have a slight trend decreasing in the mid-90's before increasing again. As we noted in the previous paragraph L1 size compared to main memory was most likely the main cause for the increase in associativity. However since this happened around the time cache size growth was seen to stall and even backpedal in some cases maybe there is a correlation. It is possible that due to processor speeds increasing engineers could not develop a cache that was both larger and faster at the same pace. So it had to either be larger or faster. In order to keep up with these rapid speed increases at the time they may have had to sacrifice associativity since it can slow a cache down by searching. Once these technological hurdles were overcome though size and associativity could increase again.
References
<references/>
- Itanium Specs(p.20) Intel Datasheet
- Cache Evolution
- Intel Processors Wikipedia
- First on-board L1 Wikipedia
- Cache Trend Table
- Sector Caches
- DDR2/3 Speeds
- Memory Wall Wulf & McKee
- Pentium Pro Specs Gateway Datasheet
- Clock Rate vs. IPC Argawal et al.
- Mem Latency-Tol. Methods Wang et al.
- Win1.0 Wikipedia
- Win95 Microsoft
- WinXP Microsoft