CSC/ECE 506 Spring 2011/ch6a jp
Cache Hierarchy
In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.
To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.
There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.
Cache Write Policies
Write hit policies
Write-through
Also known as store-through, this policy will write to main memory whenever a write is performed to cache.
Write-back
Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.
Write miss policies
Write-allocate vs Write no-allocate
When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.
Fetch-on-write vs no-fetch-on-write
The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss.
Write-before-hit vs no-write-before-hit
The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.
Combination Policies
Write-validate
It is a combination of no-fetch-on-write and write-allocate[4]. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.
Write-invalidate
This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate. This policy invalidates lines when there is a miss.
Write-around
Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line.
Prefetching
Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the perfected data will be used. Goal here is to reduce cache misses to improve overall performance.
Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data.
Advantages
Improves overall performance by reducing cache misses.
Disadvantages
- Wastes bandwidth when prefetched data is not used.
- Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.
- Software prefetching adds additional instructions to the program, making the program larger.
- If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.
- When scheduler changes the task running on a processor, prefetched data may become useless.
Effectiveness
Stream Buffer Prefetching
Prefetching in Parallel Computing
References
1. http://www.real-knowledge.com/memory.htm
2. Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg
3. Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin
4. “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.
5. Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer
Header 1 | Header 2 | Header 3 |
---|---|---|
row 1, cell 1 | row 1, cell 2 | row 1, cell 3 |
row 2, cell 1 | row 2, cell 2 | row 2, cell 3 |
row 3, cell 1 | row 3, cell 2 | row 3, cell 3 |
× | 1 | 2 | 3 |
---|---|---|---|
1 | 1 | 2 | 3 |
2 | 2 | 4 | 6 |
3 | 3 | 6 | 9 |