CSC/ECE 506 Spring 2012/10a jp: Difference between revisions
No edit summary |
No edit summary |
||
Line 31: | Line 31: | ||
'''Simulated Processing Node Architecture''' | '''Simulated Processing Node Architecture''' | ||
[[Image:P1.PNG|thumbnail|right|x200px|Figure 1:Interfacing in Simulated Processing Node Architecture <sup><span id="3body"></span></sup>]] | [[Image:P1.PNG|thumbnail|right|x200px|Figure 1:Interfacing in Simulated Processing Node Architecture <sup><span id="3body"></span></sup>]] | ||
For both prefetching techniques, we only prefetch into the SLC. In addition, SLWB keeps track of the outstanding requests ( SLC read miss, prefetch and write requests). The SLC controller deals with all the complexities of the cache coherence protocol and the prefetching mechanism.The SLC snoops on the local bus for consistency actions: if an invalidation to a block residing in the SLC is detected, the copy of teh block is invalidated in both the SLC and the FLC. Requests from the local bus have priority over those from the FLWB. These interferences with the processor accesses are tolerable because most accesses hit in the FLC. | In contrast to software-controlled prefetching, the processors do not need to support specific prefetch instructions.The environment to which these processors are interfaced is shown in figure below.According to the figure, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on wirte misses and is blocking on read misses. The SLC is a direct-mapped write-back cache.For both prefetching techniques, we only prefetch into the SLC. In addition, SLWB keeps track of the outstanding requests ( SLC read miss, prefetch and write requests). The SLC controller deals with all the complexities of the cache coherence protocol and the prefetching mechanism.The SLC snoops on the local bus for consistency actions: if an invalidation to a block residing in the SLC is detected, the copy of teh block is invalidated in both the SLC and the FLC. Requests from the local bus have priority over those from the FLWB. These interferences with the processor accesses are tolerable because most accesses hit in the FLC. | ||
=='''Performance Advantages'''== | =='''Performance Advantages'''== |
Revision as of 04:04, 4 April 2012
Prefetching and consistency models
Introduction
Buffering and pipelining are attractive techniques for hiding the latency of memory accesses in large scale shared-memory multiprocessors. However, the unconstrained use of these techniques can result in an intractable programming model for the machine. Consistency models provide more tractable programming models by introducing various restrictions on the amount of buffering and pipelining allowed. Several memory consistency models have been proposed in the literature. The strictest model is sequential consistency (which requires the execution of a parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine. Sequential consistency imposes severe restrictions on buffering and pipelining of memory accesses.
Prefetching
The delay constraints imposed by a consistency model limit the amount of buffering and pipelining among memory accesses. Prefetching provides one method for increasing performance by partially proceeding with an access that is delayed due to consistency model constraints. Prefetches can be initiated either by an explicit fetch operation within a program, by logic that monitors the processor’s referencing pattern to infer prefetching, or by a combination of these approaches. However they are initiated, prefetches must be issued in a timely manner. If a prefetch is issued too early there is a chance that the prefetched data will displace other useful data from the higher levels of the memory hierarchy or be displaced itself before use. If the prefetch is issued too late, it may not arrive before the actual memory reference and thereby introduce processor stall cycles. Software prefetching issues fetches only for data that is likely to be used while hardware schemes tend data in a more speculative manner.
The decision of where to place prefetched data in the memory hierarchy is a fundamental design decision. Clearly, data must be moved into a higher level of the memory hierarchy to provide a performance benefit. The majority of schemes place prefetched data in some type of cache memory. Other schemes place prefetched data in dedicated buffers to protect the data from premature cache evictions and prevent cache pollution. When prefetched data are placed into named locations, such as processor registers or memory, the prefetch is said to be binding and additional constraints must be imposed on the use of the data. Finally, multiprocessor systems can introduce additional levels into the memory hierarchy which must be taken into consideration.Data can be prefetched in units of single words, cache blocks, contiguous blocks of memory or program data objects. Often, the amount of data fetched is determined by the organization of the underlying cache and memory system. Cache blocks may be the most appropriate size for uniprocessors and Symmetric Multiprocessors (SMPs) while larger memory blocks may be used to amortize the cost of initiating a data transfer across an interconnection network of a large, distributed memory multiprocessor.
Description
Prefetching can be classified based on whether it is binding or non-binding, and whether it is controlled by hardware or software. With a binding prefetch, the value of a later reference (e.g., a register load) is bound at the time the prefetch completes. This places restrictions on when a binding prefetch canbe issued, since the value will become stale if another processor modifies the location during the interval between prefetch and reference. Hardware cache-coherent architectures, such as the Stanford DASH multiprocessor, can provide prefetching that is non-binding. With a non-binding prefetch, the data is brought close to the processor (e.g., into the cache) and is kept coherent until the processor actually reads the value. Thus, non-binding prefetching does not affect correctness for any of the consistency models and can be used as simply a performance boosting technique. The technique described in this section assumes hardware-controlled non-binding prefetch.
Prefetching can enhance performance by partially servicing large latency accesses that are delayed due to consistency model constraints. For a read operation, a read prefetch can be used to bring the data into the cache in a read-shared state while the operation is delayed due to consistency constraints. Since the prefetch is non-binding, we are guaranteed that the read operation will return a correct value once it is allowed to perform, regardless of when the prefetch completed. In the majority of cases, we expect the result returned by the prefetch to be the correct result. The only time the result may be different is if the location is written to between the time the prefetch returns the value and the time the read is allowed to perform. In this case, the prefetched location would either get invalidated or updated, depending on the coherence scheme. If invalidated, the read operation will miss in the cache and access the new value from the memory system, as if the prefetch never occurred. In the case of an update protocol, the location is kept up-to-date, thus providing the new value to the read operation.
For a write operation, a read-exclusive prefetch can be used to acquire exclusive ownership of the line, enabling the write to that location to complete quickly once it is allowed to perform. A read-exclusive prefetch is only possible if the coherence scheme is invalidation-based. Similar to the read prefetch case, the line is invalidated if another processor writes to the location between the time the read-exclusive prefetch completes and the actual write operation is allowed to proceed. In addition, exclusive ownership is surrendered if another processor reads the location during that time.
Implementation
This subsection discusses the requirements that the prefetch technique imposes on a multiprocessor architecture. Let us first consider how the proposed prefetch technique can be incorporated into the processor environment. Assume the general case where the processor has a load and a store buffer. The usual way to enforce a consistency model is to delay the issue of accesses in the buffer until certain previous accesses complete. Prefetching can be incorporated in this framework by having the hardware automatically issue a prefetch (read prefetch for reads and read-exclusive prefetch for writes and atomic read-modify writes for accesses that are in the load or store buffer, but are delayed due to consistency constraints. A prefetch buffer may be used to buffer multiple prefetch requests. Prefetches can be retired from this buffer as fast as the cache and memory system allow.
A prefetch request first checks the cache to see whether the line is already present. If so, the prefetch is discarded. Otherwise the prefetch is issued to the memory system. When the prefetch response returns to the processor, it is placed in the cache. If a processor references a location it has prefetched before the result has returned, the reference request is combined with the prefetch request so that a duplicate request is not sent out and the reference completes as soon as the prefetch result returns. The prefetch technique discussed imposes several requirements on the memory system. Most importantly, the architecture requires hardware coherent caches. In addition, the location to be prefetched needs to be cachable. Also, to be effective for writes, prefetching requires an invalidation-based coherence scheme. In update-based schemes, it is difficult to partially service a write operation without making the new value available to other processors, which results in the write being performed.The strengths and weaknesses of hardware controlled non-binding prefetching are discussed in the next subsection.
Hardware Implementation
Simulated Processing Node Architecture
In contrast to software-controlled prefetching, the processors do not need to support specific prefetch instructions.The environment to which these processors are interfaced is shown in figure below.According to the figure, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on wirte misses and is blocking on read misses. The SLC is a direct-mapped write-back cache.For both prefetching techniques, we only prefetch into the SLC. In addition, SLWB keeps track of the outstanding requests ( SLC read miss, prefetch and write requests). The SLC controller deals with all the complexities of the cache coherence protocol and the prefetching mechanism.The SLC snoops on the local bus for consistency actions: if an invalidation to a block residing in the SLC is detected, the copy of teh block is invalidated in both the SLC and the FLC. Requests from the local bus have priority over those from the FLWB. These interferences with the processor accesses are tolerable because most accesses hit in the FLC.
Performance Advantages
Performance Disadvantages
References
Parallel computer architecture: a hardware/software approach. David E. Culler, Jaswinder Pal Singh, Anoop Gupta. http://www.nikolaylaptev.com/master/classes/cs254.pdf http://www.multicoreinfo.com/prefetching-multicore-processors/ http://dl.acm.org/citation.cfm?id=1168892 http://www.cs.uwaterloo.ca/~brecht/courses/856/Possible-Readings/prefetching-to-cache/software-prefetching-p40-callahan.pdf http://www.jilp.org/vol6/v6paper7.pdf