CSC/ECE 506 Spring 2012/10a jp
Prefetching and consistency models
Buffering and pipelining are attractive techniques for hiding the latency of memory accesses in large scale shared-memory multiprocessors. However, the unconstrained use of these techniques can result in an intractable programming model for the machine. Consistency models provide more tractable programming models by introducing various restrictions on the amount of buffering and pipelining allowed. Several memory consistency models have been proposed in the literature. The strictest model is sequential consistency (which requires the execution of a parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine. Sequential consistency imposes severe restrictions on buffering and pipelining of memory accesses.
Prefetching
The delay constraints imposed by a consistency model limit the amount of buffering and pipelining among memory accesses. Prefetching provides one method for increasing performance by partially proceeding with an access that is delayed due to consistency model constraints. Prefetches can be initiated either by an explicit fetch operation within a program, by logic that monitors the processor’s referencing pattern to infer prefetching, or by a combination of these approaches. However they are initiated, prefetches must be issued in a timely manner. If a prefetch is issued too early there is a chance that the prefetched data will displace other useful data from the higher levels of the memory hierarchy or be displaced itself before use. If the prefetch is issued too late, it may not arrive before the actual memory reference and thereby introduce processor stall cycles. Software prefetching issues fetches only for data that is likely to be used while hardware schemes tend data in a more speculative manner. The decision of where to place prefetched data in the memory hierarchy is a fundamental design decision. Clearly, data must be moved into a higher level of the memory hierarchy to provide a performance benefit. The majority of schemes place prefetched data in some type of cache memory. Other schemes place prefetched data in dedicated buffers to protect the data from premature cache evictions and prevent cache pollution. When prefetched data are placed into named locations, such as processor registers or memory, the prefetch is said to be binding and additional constraints must be imposed on the use of the data. Finally, multiprocessor systems can introduce additional levels into the memory hierarchy which must be taken into consideration. Data can be prefetched in units of single words, cache blocks, contiguous blocks of memory or program data objects. Often, the amount of data fetched is determined by the organization of the underlying cache and memory system. Cache blocks may be the most appropriate size for uniprocessors and Symmetric Multiprocessors (SMPs) while larger memory blocks may be used to amortize the cost of initiating a data transfer across an interconnection network of a large, distributed memory multiprocessor.
Prefetching Implementations
Performance Advantages
Performance Disadvantages
References
Parallel computer architecture: a hardware/software approach. David E. Culler, Jaswinder Pal Singh, Anoop Gupta. http://www.nikolaylaptev.com/master/classes/cs254.pdf http://www.multicoreinfo.com/prefetching-multicore-processors/ http://dl.acm.org/citation.cfm?id=1168892 http://www.cs.uwaterloo.ca/~brecht/courses/856/Possible-Readings/prefetching-to-cache/software-prefetching-p40-callahan.pdf http://www.jilp.org/vol6/v6paper7.pdf