CSC/ECE 506 Spring 2012/10a vm
Landing page for "Prefetching and consistency models."
Prefetching
Introduction to prefetching
Almost all processors today, use prefetching as a means to speed up execution. Primarily prefetching is used to shorten the amount of time a processor is in wait state by predicting which cache block would be accessed, so that where there is a cache miss the old block in the cache can be replaced with the prefetched block immediately and thus decreasing the idle time of the processor. Performance in prefetching is best when it is done by following the program order. However it need not always be that prefetching is done in program order, a processor trying to guess the result of a calculation during a complex branch prediction algorithm will need to anticipate the result and fetch the right set of instructions for execution. Things get more complex when it comes to graphical processing units or GPUs. Prefetching can take advantage of spatial coherence and the data that is prefetched are not a set of instructions but instead they are texture elements that can be mapped to a polygon.<ref>Instruction prefetch wiki article.</ref>
Types of prefetching
Prefetching can be primarily classified based on whether it is binding or non-binding and whether it is hardware or software controlled. Using a binding prefetch, the value of a later reference, like a register load, is bound at the time the prefetch completes. However this comes with restrictions because the value in the prefetch might not be accurate, there might be invalidation that can be caused by another processor in the time frame between prefetch and reference. Coming to the non-binding prefetch, for the data is got into the processor cache, the coherence is maintained till the processor reads or writes the value. This approach unlike the binding prefetch will not have an effect on the correctness of any consistency model. This can serve as a serious improvement in terms of performance.
The kind of prefetch that is interesting to us is the hardware controlled and non-binding prefetch. Prefetching done is software involves significant effort from the user side and it may not be scalable as the length of the program increases. Whereas non-binding prefetch yields significantly more performance when compared to the binding prefetch. The major performance enhancement from prefetching comes with decreasing the memory latency that are caused by all consistency models.
In the case of a read operation, the read prefetch is issed to get the data in a read-only shared state in the cache. Since we are considering the non-binding prefetch there is a guarantee that the read operation is going to return a correct value when it is allowed to run irrespective of the actual prefetch completion time. However there is also a catch to this, the value read might not always be correct. Consider a case where there is a write operation performed on the memory location that was just read-prefetched. Thus the value can be read again depending on whether an invalidation or an update based coherence scheme is being used. If an update based coherence scheme is being used, we can be assured that the other processor sends a bus update and the current processor can pick it up to update its value. However, if an invalidation based coherence scheme is used, there would be a coherence miss and the value will be read again from the system memory giving an illusion that the prefetch did not occur.
In the case of a write operation, to acquire the exclusive ownership of the line, a read-exclusive prefetch can be used. Since it is cached in the exclusive mode, a write operation can proceed without incurring an invalidate or an update on other caches which have cached the same line or block. The write operation performs quickly because the value is already cached and there is not coherence miss, thus reducing the idle time of the processor. However, it is to be noted that the read-exclusive prefetch is only possible if the coherence scheme is invalidation based and not an update one. The same logic of incorrectness applies here that we have seen in the read case, suppose that another processor asks for a write to the same block, the exclusive ownership will no longer be maintained. Instead the cache block will be invalidated.
How does prefetching work ?
In a multiprocessor environment, each processor has a load and a store buffer. The way in which consistency models work is that the accesses to memory locations are delayed until the previous requests have finished. While issuing a read or a write request, we can implement the prefetch by making the hardware automatically issue a prefetch for requests which are in the load or store buffer and those requests which cannot be honored right away because of the delays incurred due to constraints.
Let us trace through the steps a prefetch request would follow. First it looks up the cache to see whether the block is already present. If it is indeed available, the request is terminated. If it is not present, the prefetch issued to memory. Depending upon what type of prefetch technique we are using, the response to the prefetch might be immediate or not. The response is then placed in the cache. An optimization which can be done is detecting duplicate requests to the same location and combining the responses to each individual request to point to the prefetch request and thus preventing the issue of duplicate prefetch requests to memory. This mechanism also taxes the memory system in the form of requirements. Firstly, it requires hardware coherent caches and the block which is prefetched needs to fit into the cache. Then for the write prefetch to work efficiently, there should be an invalidation based coherence scheme in place. Not only these, the bus bandwidth should be high in order to handle several prefetch requests at the same time. The caches will also be more busy when prefetch is implemented.
The amount of lookahead we do for a prefetch needs to be controlled, in the sense that if more number of blocks are prefetched than required, then the processor might not ever use it and it would also result in cache misses of actually required blocks. Thus the extent to which lookahead is done should dynamically adjust depending on whether the processor actually uses the blocks or just evicts them without a read or a write reference to it.
Prefetching with Sequential Consistency
Sequential Consistency (SC) might cause a reader to assume that its implementation will need a single memory module. This is however not true, both SC and the relaxed memory models allow many optimizations within them in order to deliver high performance. SC and the relaxed models allow non-binding prefetching along with coherent caching.
Let us take a look at an example and see how SC vs prefetching with SC performs
Assumptions being made are that the processor has non-blocking reads with branch prediction machinery, invalidation based cache coherence scheme is used, if there is a cache hit the latency is 1 clock cycle and the cache miss incurs a latency of 100 clock cycles.
The following code segment represents a common producer scenario:
1) lock L (miss) 2) write A (miss) 3) write B (miss) 4) unlock L (hit)
Measuring how the code performs under SC only
Since we are considering a system with SC, we know that an access to memory cannot be issued until the previous access has completed. The first three accesses are misses and they occur one after the other without any interleaving, which gives us a total latency of 100 x 3 = 300 clock cycles. The fourth operation however is a cache hit, thus adding only 1 clock cycle. The four accesses take up 301 clock cycles to run.
Measuring how the same code performs under prefetching with SC
With prefetching we can get significant improvement in terms of performance. Since the system now has the ability to do prefetching and the processor has branch prediction machinery, let us assume that the branch predictor takes the loop in which the lock L has been successfully acquired. The lookahead operates and A and B are prefetched into memory in the read exclusive mode. When the statement lock L executes, the write accesses to A and B are served in parallel as well because they are prefetched. Thus the total would sum to 102 clock cycles (2 hits in parallel and 1 miss). The fourth memory access is a hit and takes only 1 clock cycle to finish thus giving a total of 103 clock cycles.
Comparing the 103 clock cycles in prefetching under SC .vs. SC only we see that there is nearly a 200 clock cycle difference which is a significant improvement. As the number of accesses increase, the difference is only going to get larger until a certain point where "SC only" would no longer be a viable option.
Another example showing how SC vs prefetching with SC performs
The same assumptions are being made which we made for the first example.
The following code segment represents a common producer scenario:
1) lock L (miss) 2) read C (miss) 3) read D (hit) 4) read E[D] (miss) 5) unlock L (hit)
Measuring how the code performs under SC only
Under the critical section we see that there are three read accesses. Read D is assumed to be a hit and read E[D] may be a hit or a miss depending on the value D. Doing the same math that we have done above, under SC only we see that the total number of clock cycles are 302.
Measuring how the same code performs under prefetching with SC
With prefetching, we see that the total time taken will be 203 (2 misses & 3 hits).
Thus again, we see that prefetching with SC performs better than "SC only".
Evaluating the performance of prefetching with SC
We have two examples, where prefetching with SC does a better job than "SC only". But what we need to look at is whether this is the maximum performance which can be extracted. Consider the second example where we had 3 reads in a critical section. The address which E[D] is trying to access will depend on the value of D. Read D was a cache hit, and E[D] is not allowed to perform under SC until the read C finishes. Thus, even though prefetching increases performance by reducing the time delay caused due to consistency constraints, it does not deal with situations where out of order usage of the values returned by memory access is critical for the processor to proceed efficiently. That is prefetching fails to do a good job when out of order consumption of values is important.
Even though the main use of prefetching is to enable the servicing of the requests faster when compared to the delay constraints imposed by the memory contraint models, correctness needs to be maintained at all times. The number of times a prefetched value is incorrect should be small, and luckily this is indeed true in the real world for several reasons. The invalidation of the prefetched values is loosely coupled with the delay to obtain a correct execution sequence. These delays are imposed by the memory consistency models, and in SC which is the stricter of all models imposes many delays. Usually, the time when one process releases a synchronization is much earlier when compared another process trying to acquire it. Thus the delays introduced by SC are unnecessary and therefore taxes the system.
Such cases are not difficult to find in real applications and hence prefetching cannot be used all the time. Therefore prefetching under SC was not widely adopted and hence new schemes had to be developed in order to overcome the drawbacks of this method.
Prefetching under Release Consistency
Prefetching under release consistency does at least as good as prefetching under SC. Let us evaluate its performance in the two examples we looked at previously. We follow the same assumptions about the system which we have followed previously.
The first example
The following code segment represents a common producer scenario:
1) lock L (miss) 2) write A (miss) 3) write B (miss) 4) unlock L (hit)
Measuring how the code performs under RC only
Since we are considering a system with RC, we know that write accesses to memory are delayed until the lock access can be performed. The lock L operation takes 100 clock cycles to finish, followed by another 100+1 cycles for write A, write B to finish and then finally the fourth operation takes 1 clock cycle to finish because it is a hit. Thus the total number of clock cycles are 202 under the "RC only" approach.
Measuring how the same code performs under prefetching with RC
With prefetching we can get significant improvement in terms of performance. Since the system now has the ability to do prefetching and the processor has branch prediction machinery, let us assume that the branch predictor takes the loop in which the lock L has been successfully acquired. The lookahead operates and A and B are prefetched into memory in the read exclusive mode. When the statement lock L executes, the write accesses to A and B are served in parallel as well because they are prefetched. Thus the total would sum to 102 clock cycles (2 hits in parallel and 1 miss). The fourth memory access is a hit and takes only 1 clock cycle to finish thus giving a total of 103 clock cycles.
In this example, we found that prefetching with SC and prefetching with RC take the amount of time.
The second example
The following code segment represents a common producer scenario:
1) lock L (miss) 2) read C (miss) 3) read D (hit) 4) read E[D] (miss) 5) unlock L (hit)
Measuring how the code performs under RC only
Under the critical section we see that there are three read accesses. Read D is assumed to be a hit and read E[D] may be a hit or a miss depending on the value D. Doing the same math that we have done above, under "RC only" we can see that the total number of clock cycles are 203.
Measuring how the same code performs under prefetching with RC
With prefetching, we see that the total time taken will be 202 (2 misses & 2 hits pipelined).
Thus again, we see that prefetching with RC performs better than "RC only".
However, we have not got a significant performance gain in the second example. Thus even though RC allows more relaxation on the order of operations issued to memory, the gain is not much. Relaxed models offer more options on the hardware implementation perspective than SC so we might get tempted to conclude that hardware must use the relaxed models instead of SC. On the other hand, the performance gain we expect too see is not delivered under prefetching with RC, thus it cannot be justified to be a good approach either. Even though it performs better than prefetching under SC.
References
<references />