CSC/ECE 506 Spring 2013/10a os

From Expertiza_Wiki
Jump to navigation Jump to search

Prefetching and Memory Consistency Models


Cache Misses

Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. Conflict misses are misses that would not occur if the cache was fully-associative and had LRU replacement. Compulsory misses are misses required in any cache organization because they are the first references to an instruction or piece of data. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.

Prefetching

Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss.

Prefetching can be achieved in two ways:

  • Software Prefetching
  • Hardware Prefetching

Software and Hardware Prefetching<ref>http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html</ref>

With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.


Many modern processors implement hardware prefetching. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.

A stream prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.

A stride prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.

An adjacent cache line prefetcher automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.

Software vs. Hardware Prefetching<ref>http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture</ref>

Software prefetching has the following characteristics:

1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.

2. Can use less bus bandwidth than hardware prefetching.

3. Software prefetches must be added to new code, and they do not benefit existing applications.


The characteristics of the hardware prefetching are as follows :

1. Works with existing applications

2. Requires regular access patterns

3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.

4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).

Hardware-based Prefetching Techniques<ref>http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html</ref><ref>http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html</ref>

Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:

  • On-chip Schemes: Based on the addresses required by the processor in all data references.
  • Off-chip Schemes: Based on the addresses that result in L1 cahce misses

Some of the most commonly used Prefetching techniques are discussed below. The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types:

(i) always prefetch - prefetch the next block on each reference (ii) prefetch on miss - prefetch the next block only on a miss (iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time.

P-Block Lookahead extends the idea of OBL by prefetching blocks instead of one, where is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.

A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.

Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality. History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns.

Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.

Disadvantages of Prefetching

1. Increased Complexity and overhead of handling the perfetching algorithms. Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.

2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.

3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.

References

<references/>