CSC/ECE 506 Spring 2013/6a cs
Cache Hierarchy
In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.
To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.
There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.
Cache management is impacted by three characteristics of modern processor architectures: multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace. Two-level and three-level cache hierarchies are common. L1 typically ranges from 16-64KB and provide access in 2-4 cycles. L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles. L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[10]
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor. Typically, lower-level caches are private and high-level caches may or may not be shared. Notice that none of the examples in the below have a shared L1 cache. Cache management functions must consider both shared and private caches when reading and writing data from memory.
Company & Processor | # cores | L1 cache Per core | L1 cache Per core | L3 cache | Year |
---|---|---|---|---|---|
Intel Pentium Dual Core | 2 | I:32KB D:32KB | 1MB 8 way set assoc. | - | 2006 |
Intel Xeon Clovertown | 2 | I:4*32KB D:4*32KB | 2*4MB | - | Jan 2007 |
Intel Xeon 3200 series | 4 | - | 2*4MB | - | Jan 2007 |
AMD Athlon 64FX | 2 | I:64KB D:4KB Both 2-way set assoc. | 1MB 16way set assoc. | - | Jan 2007 |
AMD Athlon 64X2 | 2 | I:64KB D:4KB Both 2-way set assoc. | 512KB/1MB 16way set assoc. | 2MB | Jan 2007 |
AMD Barcelona | 4 | I:64KB D:64KB | 512KB | 2MB Shared | Aug 2007 |
Sun Microsystems Ultra Sparc T2 | 8 | I:16KB D:8KB | 4MB (8 banks) 16way set assoc. | - | Oct 2007 |
Intel Xeon Wolfdale DP | 2 | D:96KB | 6MB | - | Nov 2007 |
Intel Xeon Hapertown | 4 | D:96KB | 2*6MB | - | Nov 2007 |
AMD Phenom | 4 | I:64KB D:64KB | 512KB | 2MB Shared | Nov 2007 Mar 2008 |
Intel Core 2 Duo | 2 | I:32KB D:32KB | 2/4MB 8 way set assoc. | - | 2008 |
Intel Penryn Wolfdale DP | 4 | - | 6-12MB | 6MB | Mar 2008 Aug 2008 |
Intel Core 2 Quad Yorkfield | 4 | D:96KB | 12MB | - | Mar 2008 |
AMD Toliman | 3K10 | I:64KB D:64KB | 512KB | 2MB Shared | Apr 2008 |
Azul Systems Vega3 7300 Series | 864 | 768GB | - | - | May 2008 |
IBM RoadRunner | 8+1 | 32KB | 512KB | - | Jun 2008 |
Cache Write Policies
In section 6.2.3[10], cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required. The write miss policies covered in the text[10], write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line. Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.
Write hit policies
- Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.
- Advantages:
- Easy to implement
- Main memory has most recent copy of the data
- Read misses never result in writes to main memory
- Disadvantages:
- Every write needs to access main memory
- Bandwidth intensive
- Writes are slower
- Advantages:
- Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.
- Advantages:
- Writes are as fast as the speed of the cache memory
- Multiple writes to a block require one write to main memory
- Less bandwidth intensive
- Disadvantages:
- Harder to implement
- Main memory may not be consistent with cache
- Reads that result in data replacement may cause dirt blocks to be written to main memory
- Advantages:
Write miss policies
Write miss policies can help eliminate write misses and therefore reduce bus traffic. This reduces the wait-time of all processors sharing the bus.
- Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.
- Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss.
- Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.
Combination Policies
- Write-validate: It is a combination of no-fetch-on-write and write-allocate[4]. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines.
- Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate[4]. This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around.
Of these three combinations, write validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.
- Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit[4]. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.
Prefetching
Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.
Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data[8].
Advantages
- Improves overall performance by reducing cache misses.
Disadvantages
- Wastes bandwidth when prefetched data is not used.
- Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.
- Software prefetching adds additional instructions to the program, making the program larger.
- If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.
- When scheduler changes the task running on a processor, prefetched data may become useless.
Effectiveness
Prefetching effectiveness can be tracked by following matrices
- Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.
- Accuracy is defined as fraction of prefetches that are useful.
- Timeliness measures how early the prefetches arrive.
Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.
Stream Buffer Prefetching
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite[7].
Prefetching in Parallel Computing
On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.
In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:
- Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost[6].
- Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.
- In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.
Following are two examples of processors that support prefetching directly in the chip design:
Intel Core i7[11]
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently. The processor assumes that this ascending access will continue, and prefetches the next line.
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory: one for the upstreams that has 12 entries, and one for downstreams that has 4 entires. As pages are accessed, their addresses are tracked in these arrays. When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.
AMD[12]
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though. When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory. For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected. Subsequent detection of misses of sequential blocks may only prefetch a single block. This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream. In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU. AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing.
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.
References
1. http://www.real-knowledge.com/memory.htm
2. Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg
3. Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin
4. “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.
5. Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer
6. “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887)
7. “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2)
8. http://en.wikipedia.org/wiki/Instruction_prefetch
9. http://www.real-knowledge.com/memory.htm
10. Yan Solihin, Fundamentals of Parallel Computer Architecture (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151
11. "Intel® 64 and IA-32 Architectures Optimization Reference Manual", Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf
12. "Software Optimization Guide for AMD Family 10h and 12h Processors", Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf