Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Spring 2013/10a os

2013-04-03T18:50:12Z

Scanjee: /* Prefetching Improvements */

[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]

'''Prefetching and Memory Consistency Models'''

Previous articles

# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]

== '''Cache Misses'''==
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. Conflict misses are misses that would not occur if the cache was fully-associative and had LRU replacement. Compulsory misses are misses required in any cache organization because they are the first references to an instruction or piece of data. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.

== '''Prefetching'''==
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss.

Prefetching can be achieved in two ways:

* Software Prefetching

* Hardware Prefetching

===Software and Hardware Prefetching<ref>http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html</ref>===

With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.

Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.

1. A stream prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.

2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.

3. An [http://www.techarp.com/showfreebog.aspx?lang=0&bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.

====Software vs. Hardware Prefetching<ref>http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture</ref>====
Software prefetching has the following characteristics:

* Can handle irregular access patterns, which do not trigger the hardware prefetcher.

* Can use less bus bandwidth than hardware prefetching.

* Software prefetches must be added to new code, and they do not benefit existing applications.

The characteristics of the hardware prefetching are as follows :

* Works with existing applications

* Requires regular access patterns

* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.

* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).

===Hardware-based Prefetching Techniques<ref>http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html</ref><ref>http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html</ref>===
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:
* On-chip Schemes: Based on the addresses required by the processor in all data references.
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses

Some of the most commonly used Prefetching techniques are discussed below.
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types:

1).''always prefetch'' - prefetch the next block on each reference

2).''prefetch on miss'' - prefetch the next block only on a miss

3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time.

P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.

A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.

Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns.

Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.

=='''Prefetching Improvements'''==

Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.

==='''Cache Sizing'''===

Increasing the dimensions of the cache greatly improves the performance of prefetching. This can be done either through increasing the cache size itself or increasing set associativity. A small cache size is a significant problem with prefetching because conflict misses increase dramatically. Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss. The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching. The following graph shows various levels of cache misses based on various cache sizes. With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.

For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache. As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.<ref>http://www.nikolaylaptev.com/master/classes/cs254.pdf</ref>

[[File:Cachesize.jpg]]

==='''Improved Prefetch Timing'''===

Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful. One possibility in this case is to implement a technique called Aggressive Prefetching<ref>http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html</ref>. Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence. The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future. This does, however, require a more resource-intensive machine to avoid negatively impacting performance.

Greater amounts of processing power and storage have made this more aggressive approach possible. While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now. Varying performance from a variety of devices also changes the need to be conservative in prefetching. Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.

==='''Memory Pattern Recognition'''===

Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe). One example is the stride technique. For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized. This is an improvement over a more conservative prefetcher that may just assume memory located together will be used. The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.

Another variation of this is the linked memory reference pattern. Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type. Generally, occurrences of code similar to ptr = ptr->next are considered as one structure and subsequently brought into memory together.<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref>

For example, instead of:

while (p) {
work(p)
p = next(p)
}

the linked memory reference pattern may work in the following way:

while (p) {
prefetch(next(p))
work(p)
p = next(p)
}

Here, p will be brought in ahead of time and be closer to the processor for subsequent use. This is more efficient than the initial loop.

==='''Markov Prefetcher'''===

This particular prefetch technique<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref> is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence. Generally a rather large table would be used containing previous address misses. This table is maintained in a similar manner as a cache. Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.

==='''Stealth Prefetching'''===

A more modern and innovative technique, stealth prefetching<ref>http://arnetminer.org/publication/stealth-prefetching-53889.html</ref>, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency. It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades. Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not. Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%. With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.

=='''Memory Consistency models'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref><ref>http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf</ref>==
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable.

==='''An overview of memory consistency models'''===
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:

“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”

[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]

This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques. Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.

''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed.

''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections. Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and write operations before a synchronization event have been completed.
* All loads and stores following a critical section cannot precede the section.

''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and writes before a release operation should be completed
* All acquire operations related to a critical section should be completed before handling a younger read write.
* The acquire and release operations should be atomic with respect to each other.
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model. As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.

=='''Prefetching under consistency models'''<ref>http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf</ref>==
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period. In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.
Let’s see the improvement in execution time using prefetching by considering a set of instructions given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.

Example 4.1:
lock L (miss)
write A (miss)
write B (miss)
unlock L (hit)

Example 4.2:
lock L (miss)
read C (miss)
read D (hit)
read E[D] (miss)
unlock L (hit)
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit. Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.

A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC). Hence prefetching fails to improve the performance in execution time in such cases.

=='''Disadvantages of Prefetching'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref>==

* Increased Complexity and overhead of handling the prefetching algorithms- Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.

* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.

* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.

=='''Conclusion'''==
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.

=='''Quiz'''==
1. In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?

# Yes in both models
# Yes in SC, No in RC
# No in SC, Yes in RC
# No in both SC and RC.

2. ___________ model categorizes the synchronization operation into Acquire and Release.

# Sequential Consistency
# Release Consistency
# Weak Ordering
# Processor Consistency.

3. Which of these is not a type of hardware implementation of a prefetcher?

# Predication
# Stride
# Stream
# Adjacent Cache line prefetcher

=References=
<references/>

File:NewCacheSize.jpg

2013-04-03T18:49:14Z

Scanjee:

CSC/ECE 506 Spring 2013/10a os

2013-04-03T18:46:39Z

Scanjee: /* Prefetching Improvements */

[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]

'''Prefetching and Memory Consistency Models'''

Previous articles

# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]

== '''Cache Misses'''==
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. Conflict misses are misses that would not occur if the cache was fully-associative and had LRU replacement. Compulsory misses are misses required in any cache organization because they are the first references to an instruction or piece of data. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.

== '''Prefetching'''==
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss.

Prefetching can be achieved in two ways:

* Software Prefetching

* Hardware Prefetching

===Software and Hardware Prefetching<ref>http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html</ref>===

With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.

Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.

1. A stream prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.

2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.

3. An [http://www.techarp.com/showfreebog.aspx?lang=0&bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.

====Software vs. Hardware Prefetching<ref>http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture</ref>====
Software prefetching has the following characteristics:

* Can handle irregular access patterns, which do not trigger the hardware prefetcher.

* Can use less bus bandwidth than hardware prefetching.

* Software prefetches must be added to new code, and they do not benefit existing applications.

The characteristics of the hardware prefetching are as follows :

* Works with existing applications

* Requires regular access patterns

* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.

* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).

===Hardware-based Prefetching Techniques<ref>http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html</ref><ref>http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html</ref>===
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:
* On-chip Schemes: Based on the addresses required by the processor in all data references.
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses

Some of the most commonly used Prefetching techniques are discussed below.
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types:

1).''always prefetch'' - prefetch the next block on each reference

2).''prefetch on miss'' - prefetch the next block only on a miss

3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time.

P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.

A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.

Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns.

Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.

=='''Prefetching Improvements'''==

Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.

==='''Cache Sizing'''===

Increasing the dimensions of the cache greatly improves the performance of prefetching. This can be done either through increasing the cache size itself or increasing set associativity. A small cache size is a significant problem with prefetching because conflict misses increase dramatically. Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss. The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching. The following graph shows various levels of cache misses based on various cache sizes. With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.

[[Image:newCacheSize.jpg |thumb|right|300px| Cache Size]]

For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache. As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.<ref>http://www.nikolaylaptev.com/master/classes/cs254.pdf</ref>

[[File:Cachesize.jpg]]

==='''Improved Prefetch Timing'''===

Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful. One possibility in this case is to implement a technique called Aggressive Prefetching<ref>http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html</ref>. Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence. The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future. This does, however, require a more resource-intensive machine to avoid negatively impacting performance.

Greater amounts of processing power and storage have made this more aggressive approach possible. While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now. Varying performance from a variety of devices also changes the need to be conservative in prefetching. Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.

==='''Memory Pattern Recognition'''===

Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe). One example is the stride technique. For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized. This is an improvement over a more conservative prefetcher that may just assume memory located together will be used. The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.

Another variation of this is the linked memory reference pattern. Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type. Generally, occurrences of code similar to ptr = ptr->next are considered as one structure and subsequently brought into memory together.<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref>

For example, instead of:

while (p) {
work(p)
p = next(p)
}

the linked memory reference pattern may work in the following way:

while (p) {
prefetch(next(p))
work(p)
p = next(p)
}

Here, p will be brought in ahead of time and be closer to the processor for subsequent use. This is more efficient than the initial loop.

==='''Markov Prefetcher'''===

This particular prefetch technique<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref> is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence. Generally a rather large table would be used containing previous address misses. This table is maintained in a similar manner as a cache. Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.

==='''Stealth Prefetching'''===

A more modern and innovative technique, stealth prefetching<ref>http://arnetminer.org/publication/stealth-prefetching-53889.html</ref>, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency. It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades. Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not. Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%. With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.

=='''Memory Consistency models'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref><ref>http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf</ref>==
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable.

==='''An overview of memory consistency models'''===
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:

“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”

[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]

This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques. Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.

''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed.

''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections. Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and write operations before a synchronization event have been completed.
* All loads and stores following a critical section cannot precede the section.

''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and writes before a release operation should be completed
* All acquire operations related to a critical section should be completed before handling a younger read write.
* The acquire and release operations should be atomic with respect to each other.
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model. As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.

=='''Prefetching under consistency models'''<ref>http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf</ref>==
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period. In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.
Let’s see the improvement in execution time using prefetching by considering a set of instructions given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.

Example 4.1:
lock L (miss)
write A (miss)
write B (miss)
unlock L (hit)

Example 4.2:
lock L (miss)
read C (miss)
read D (hit)
read E[D] (miss)
unlock L (hit)
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit. Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.

A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC). Hence prefetching fails to improve the performance in execution time in such cases.

=='''Disadvantages of Prefetching'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref>==

* Increased Complexity and overhead of handling the prefetching algorithms- Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.

* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.

* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.

=='''Conclusion'''==
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.

=='''Quiz'''==
1. In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?

# Yes in both models
# Yes in SC, No in RC
# No in SC, Yes in RC
# No in both SC and RC.

2. ___________ model categorizes the synchronization operation into Acquire and Release.

# Sequential Consistency
# Release Consistency
# Weak Ordering
# Processor Consistency.

3. Which of these is not a type of hardware implementation of a prefetcher?

# Predication
# Stride
# Stream
# Adjacent Cache line prefetcher

=References=
<references/>

CSC/ECE 506 Spring 2013/10a os

2013-04-03T18:45:36Z

Scanjee: /* Prefetching Improvements */

[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]

'''Prefetching and Memory Consistency Models'''

Previous articles

# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]

== '''Cache Misses'''==
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. Conflict misses are misses that would not occur if the cache was fully-associative and had LRU replacement. Compulsory misses are misses required in any cache organization because they are the first references to an instruction or piece of data. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.

== '''Prefetching'''==
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss.

Prefetching can be achieved in two ways:

* Software Prefetching

* Hardware Prefetching

===Software and Hardware Prefetching<ref>http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html</ref>===

With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.

Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.

1. A stream prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.

2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.

3. An [http://www.techarp.com/showfreebog.aspx?lang=0&bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.

====Software vs. Hardware Prefetching<ref>http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture</ref>====
Software prefetching has the following characteristics:

* Can handle irregular access patterns, which do not trigger the hardware prefetcher.

* Can use less bus bandwidth than hardware prefetching.

* Software prefetches must be added to new code, and they do not benefit existing applications.

The characteristics of the hardware prefetching are as follows :

* Works with existing applications

* Requires regular access patterns

* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.

* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).

===Hardware-based Prefetching Techniques<ref>http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html</ref><ref>http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html</ref>===
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:
* On-chip Schemes: Based on the addresses required by the processor in all data references.
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses

Some of the most commonly used Prefetching techniques are discussed below.
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types:

1).''always prefetch'' - prefetch the next block on each reference

2).''prefetch on miss'' - prefetch the next block only on a miss

3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time.

P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.

A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.

Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns.

Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.

=='''Prefetching Improvements'''==

Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.

==='''Cache Sizing'''===

Increasing the dimensions of the cache greatly improves the performance of prefetching. This can be done either through increasing the cache size itself or increasing set associativity. A small cache size is a significant problem with prefetching because conflict misses increase dramatically. Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss. The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching. The following graph shows various levels of cache misses based on various cache sizes. With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.

For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache. As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.<ref>http://www.nikolaylaptev.com/master/classes/cs254.pdf</ref>

[[File:Cachesize.jpg]]
[[Image: |thumb|right|300px| Cache Size]]

==='''Improved Prefetch Timing'''===

Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful. One possibility in this case is to implement a technique called Aggressive Prefetching<ref>http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html</ref>. Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence. The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future. This does, however, require a more resource-intensive machine to avoid negatively impacting performance.

Greater amounts of processing power and storage have made this more aggressive approach possible. While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now. Varying performance from a variety of devices also changes the need to be conservative in prefetching. Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.

==='''Memory Pattern Recognition'''===

Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe). One example is the stride technique. For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized. This is an improvement over a more conservative prefetcher that may just assume memory located together will be used. The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.

Another variation of this is the linked memory reference pattern. Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type. Generally, occurrences of code similar to ptr = ptr->next are considered as one structure and subsequently brought into memory together.<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref>

For example, instead of:

while (p) {
work(p)
p = next(p)
}

the linked memory reference pattern may work in the following way:

while (p) {
prefetch(next(p))
work(p)
p = next(p)
}

Here, p will be brought in ahead of time and be closer to the processor for subsequent use. This is more efficient than the initial loop.

==='''Markov Prefetcher'''===

This particular prefetch technique<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref> is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence. Generally a rather large table would be used containing previous address misses. This table is maintained in a similar manner as a cache. Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.

==='''Stealth Prefetching'''===

A more modern and innovative technique, stealth prefetching<ref>http://arnetminer.org/publication/stealth-prefetching-53889.html</ref>, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency. It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades. Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not. Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%. With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.

=='''Memory Consistency models'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref><ref>http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf</ref>==
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable.

==='''An overview of memory consistency models'''===
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:

“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”

[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]

This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques. Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.

''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed.

''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections. Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and write operations before a synchronization event have been completed.
* All loads and stores following a critical section cannot precede the section.

''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and writes before a release operation should be completed
* All acquire operations related to a critical section should be completed before handling a younger read write.
* The acquire and release operations should be atomic with respect to each other.
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model. As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.

=='''Prefetching under consistency models'''<ref>http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf</ref>==
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period. In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.
Let’s see the improvement in execution time using prefetching by considering a set of instructions given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.

Example 4.1:
lock L (miss)
write A (miss)
write B (miss)
unlock L (hit)

Example 4.2:
lock L (miss)
read C (miss)
read D (hit)
read E[D] (miss)
unlock L (hit)
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit. Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.

A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC). Hence prefetching fails to improve the performance in execution time in such cases.

=='''Disadvantages of Prefetching'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref>==

* Increased Complexity and overhead of handling the prefetching algorithms- Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.

* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.

* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.

=='''Conclusion'''==
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.

=='''Quiz'''==
1. In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?

# Yes in both models
# Yes in SC, No in RC
# No in SC, Yes in RC
# No in both SC and RC.

2. ___________ model categorizes the synchronization operation into Acquire and Release.

# Sequential Consistency
# Release Consistency
# Weak Ordering
# Processor Consistency.

3. Which of these is not a type of hardware implementation of a prefetcher?

# Predication
# Stride
# Stream
# Adjacent Cache line prefetcher

=References=
<references/>

CSC/ECE 506 Spring 2013/10a os

2013-04-03T18:42:37Z

Scanjee: /* Prefetching Improvements */

CSC/ECE 506 Spring 2013/10a os

2013-04-03T18:41:57Z

Scanjee: /* Prefetching Improvements */

[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]

'''Prefetching and Memory Consistency Models'''

Previous articles

# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]

== '''Cache Misses'''==
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. Conflict misses are misses that would not occur if the cache was fully-associative and had LRU replacement. Compulsory misses are misses required in any cache organization because they are the first references to an instruction or piece of data. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.

== '''Prefetching'''==
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss.

Prefetching can be achieved in two ways:

* Software Prefetching

* Hardware Prefetching

===Software and Hardware Prefetching<ref>http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html</ref>===

With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.

Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.

1. A stream prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.

2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.

3. An [http://www.techarp.com/showfreebog.aspx?lang=0&bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.

====Software vs. Hardware Prefetching<ref>http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture</ref>====
Software prefetching has the following characteristics:

* Can handle irregular access patterns, which do not trigger the hardware prefetcher.

* Can use less bus bandwidth than hardware prefetching.

* Software prefetches must be added to new code, and they do not benefit existing applications.

The characteristics of the hardware prefetching are as follows :

* Works with existing applications

* Requires regular access patterns

* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.

* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).

===Hardware-based Prefetching Techniques<ref>http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html</ref><ref>http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html</ref>===
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:
* On-chip Schemes: Based on the addresses required by the processor in all data references.
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses

Some of the most commonly used Prefetching techniques are discussed below.
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types:

1).''always prefetch'' - prefetch the next block on each reference

2).''prefetch on miss'' - prefetch the next block only on a miss

3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time.

P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.

A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.

Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns.

Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.

=='''Prefetching Improvements'''==

Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.

==='''Cache Sizing'''===

Increasing the dimensions of the cache greatly improves the performance of prefetching. This can be done either through increasing the cache size itself or increasing set associativity. A small cache size is a significant problem with prefetching because conflict misses increase dramatically. Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss. The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching. The following graph shows various levels of cache misses based on various cache sizes. With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.

For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache. As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.<ref>http://www.nikolaylaptev.com/master/classes/cs254.pdf</ref>

[[Image:Cachesize.jpg|thumb|right|300px|Cache Size]]

==='''Improved Prefetch Timing'''===

Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful. One possibility in this case is to implement a technique called Aggressive Prefetching<ref>http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html</ref>. Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence. The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future. This does, however, require a more resource-intensive machine to avoid negatively impacting performance.

Greater amounts of processing power and storage have made this more aggressive approach possible. While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now. Varying performance from a variety of devices also changes the need to be conservative in prefetching. Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.

==='''Memory Pattern Recognition'''===

Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe). One example is the stride technique. For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized. This is an improvement over a more conservative prefetcher that may just assume memory located together will be used. The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.

Another variation of this is the linked memory reference pattern. Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type. Generally, occurrences of code similar to ptr = ptr->next are considered as one structure and subsequently brought into memory together.<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref>

For example, instead of:

while (p) {
work(p)
p = next(p)
}

the linked memory reference pattern may work in the following way:

while (p) {
prefetch(next(p))
work(p)
p = next(p)
}

Here, p will be brought in ahead of time and be closer to the processor for subsequent use. This is more efficient than the initial loop.

==='''Markov Prefetcher'''===

This particular prefetch technique<ref>http://www.bergs.com/stefan/general/papers/general.pdf</ref> is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence. Generally a rather large table would be used containing previous address misses. This table is maintained in a similar manner as a cache. Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.

==='''Stealth Prefetching'''===

A more modern and innovative technique, stealth prefetching<ref>http://arnetminer.org/publication/stealth-prefetching-53889.html</ref>, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency. It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades. Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not. Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%. With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.

=='''Memory Consistency models'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref><ref>http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf</ref>==
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable.

==='''An overview of memory consistency models'''===
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:

“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”

[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]

This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques. Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.

''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed.

''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections. Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and write operations before a synchronization event have been completed.
* All loads and stores following a critical section cannot precede the section.

''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.
* All read and writes before a release operation should be completed
* All acquire operations related to a critical section should be completed before handling a younger read write.
* The acquire and release operations should be atomic with respect to each other.
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model. As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.

=='''Prefetching under consistency models'''<ref>http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf</ref>==
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period. In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.
Let’s see the improvement in execution time using prefetching by considering a set of instructions given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.

Example 4.1:
lock L (miss)
write A (miss)
write B (miss)
unlock L (hit)

Example 4.2:
lock L (miss)
read C (miss)
read D (hit)
read E[D] (miss)
unlock L (hit)
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit. Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.

A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC). Hence prefetching fails to improve the performance in execution time in such cases.

=='''Disadvantages of Prefetching'''<ref>http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr</ref>==

* Increased Complexity and overhead of handling the prefetching algorithms- Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.

* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.

* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.

=='''Conclusion'''==
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.

=='''Quiz'''==
1. In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?

# Yes in both models
# Yes in SC, No in RC
# No in SC, Yes in RC
# No in both SC and RC.

2. ___________ model categorizes the synchronization operation into Acquire and Release.

# Sequential Consistency
# Release Consistency
# Weak Ordering
# Processor Consistency.

3. Which of these is not a type of hardware implementation of a prefetcher?

# Predication
# Stride
# Stream
# Adjacent Cache line prefetcher

=References=
<references/>

File:Prefetching.png

2013-04-02T00:18:56Z

Scanjee: uploaded a new version of "File:Prefetching.png": Reverted to version as of 00:17, 2 April 2013

Illustrates how Prefetching can reduce memory latency

File:Prefetching.png

2013-04-02T00:18:40Z

Scanjee: uploaded a new version of "File:Prefetching.png": Illustrates how Prefetching can reduce memory latency

Illustrates how Prefetching can reduce memory latency

File:Prefetching.png

2013-04-02T00:17:33Z

Scanjee: Illustrates how Prefetching can reduce memory latency

Illustrates how Prefetching can reduce memory latency

CSC/ECE 506 Spring 2013/10a os

2013-04-01T23:46:54Z

Scanjee:

'''Prefetching and Memory Consistency Models'''

== '''Overview''' ==
This wiki article explores two different topics Sequential Prefetching and Memory Consistency models. The article covers description of prefetching, different types of prefetching like Fixed, Adaptive, etc explained in detail. This is followed by, different types of memory consistency models like Sequential consistency and Relaxed consistency. It also talks about the authors' and researchers' comments through examples.

= '''Prefetching''' =

Sequential prefetching is a simple hardware controlled pre fetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest for, the number of prefetched blocks on each miss is fixed throughout the execution<ref>http://129.16.20.23/~pers/pub/j5.pdf</ref>.

Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. Prefetching approaches proposed in the literature are software or hardware based.

Software controlled prefetching schemes rely on the programmer/compiler to insert prefetch instructions prior to the instructions that trigger a miss. In addition, both the processor and the memory system must be able to support prefetch instructions which can potentially increase the code size and the run-time overhead. By contrast, hardware-controlled prefetch relieve the programmer/compiler from the burden of deciding what and when to prefetch. Usually, these schemes take advantage of the regularity of data access in scientific computations by dynamically detecting access strides.

Basically there are two types of prefetching techniques

1. Fixed sequential prefetching

2. Adaptive sequential prefetching

More details about these is discussed in the following sections along with other prefetching models:

== Simulated processing Node Architecture ==
According to Fig. 2.1, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on write misses and is blocking on read misses. Writes and read miss requests are buffered in the FLWB. The second-level cache (SLC) is a direct-mapped write-back cache. For both prefetching techniques, we only prefetch into the SLC. In addition, a second-level write buffer (SLWB) keeps track of outstanding requests (SLC read miss, prefetch, and write requests). No more than one request to the same block is allowed to be issued to the system; others are just kept in the SLWB while waiting for the pending request to that block to complete. Moreover, a read miss request may bypass write requests if they are for different blocks.
[[File:Procenv.png|thumb|center|upright|350px|Figure 2.1. Processor environment and simulated architecture]]

==Fixed Sequential Prefetching<ref>http://www.springerlink.com/content/lu0755310187318n/</ref>==
By fixed sequential prefetching we mean that K consecutive blocks are prefetched into the SLC on a reference to a block, i.e., blocks n + 1 ... n + K are prefetched upon a reference to block n, if they are not present in the cache. Sequential prefetching has been extensively studied in the context of uniprocessors,but to our knowledge, have never been considered for general applications on multiprocessors. Although many sequential strategies have been proposed for uniprocessors, we have restricted ourselves to prefetching on a miss in the SLC. When a reference misses in the SLC, the miss request is sent to memory, and the cache is searched for the K consecutive blocks directly following the missing block in the address space. The blocks among the K consecutive blocks that are not present in the SLC and have no pending requests in the SLWB are prefetched. We refer to K as the degree of prefetching.
[[File:untitled1.png|thumb|center|upright|350px|Figure 2.2. The fixed sequential prefetching mechanism]]

Fig. 2.2 shows the mechanism of the fixed sequential prefetching scheme. As a cache lookup is made for block address n, the next block address
(n + 1) is calculated. On a read miss, a read request is issued to the memory system and is kept in the SLWB. In the next cache cycle, the calculated address (n + I) is directed to the cache, and a cache lookup is made. If the block is not present in the cache, a prefetch request is issued and is kept in the SLWB. During that time, the subsequent block address is calculated (n + 2). The number of iterations is determined by the degree of prefetching. The processor is blocked only during the time it takes to handle the first read miss. Since the prefetch requests are issued one at a time and are pipelined in the memory system, they can be overlapped with the original read request. Besides the simple extensions in the SLC to incorporate fixed sequential prefetching, the memory system must be able to handle three new network commands: a prefetch request and two reply messages denoted PreData and PreNeg. Whereas PreData carries the prefetched block, PreNeg tells the cache that the prefetch request cannot be satisfied because the memory copy is in a transient state-some other cache is reading or writing to it.

==Adaptive Sequential Prefetching<ref>http://web.cecs.pdx.edu/~walpole/papers/mmcn1998b.pdf</ref>==
The mechanism behind the adaptive scheme is basically the same as that of fixed sequential prefetching. For example, prefetching is activated by a read miss and blocks are prefetched into the SLC. In contrast to fixed sequential prefetching, however, the degree of prefetching is not fixed; rather it is controlled
by a register, the Lookahead Counter. The adaptive sequential prefetching scheme relies on adjusting the degree of prefetching (the value of the Lookahead- Counter) dynamically by counting the useful prefetches, i.e., prefetched blocks that are actually referenced during their lifetime in the cache. To explain how this is achieved, we will first focus on how the algorithm measures the prefetch efficiency and then how the Lookahead Counter is adjusted to a certain prefetch efficiency. The mechanisms needed to achieve these task-two bits per cache line and three counters per cache appear in Table 1.

{|class="wikitable"
|-
|PrefetchBit (per Cache Line)
|Used to detect useful prefetches (needed when prefetching is tumed on.)
|-
|ZeroBit (per cache line)
|Used to detect when a prefetch would have been useful (needed when prefetching is turned off.)
|-
|LookaheadCounter (per cache)
|The current degree of prefetching (per cache)
|-
|PrefetchCounter (per cache)
|Counts the number of prefetches that have been I returned after each read miss
|-
|UsefulCounter (per cache)
|Counts the number of useful prefetches
|}
Conceptually, the algorithm measures the prefetch efficiency by counting the fraction of prefetched blocks that are referenced by the processors. If this fraction exceeds a preset threshold, the degree of prefetching is increased and, if it is below another preset threshold, the degree of prefetching is decreased.

The basic mechanisms used to measure the prefetch efficiency consist of two counters (the PrefetchCounter and the UsefulCounter) and a PrefetchBit per cache line which are all cleared from the very beginning. The fraction of useful prefetches is established as the ratio of the UsefulCounter and the PrefetchCounter as follows. The number of prefetched blocks is counted by incrementing the PrefetchCounter whenever a prefetch acknowledgment is received from the memory system, independent of whether the prefetch was accepted (PreData) or not (PreNeg) (e.g., if the memory block was in a transient state and neither clean nor dirty) by the memory system. To count the number of prefetched blocks that are referenced, the PrefetchBit of a prefetched block is set; when a block is accessed with its PrefetchBit set, the Usefulcounter is incremented and the PrefetchBit is cleared.

Every time the PrefetchCounter reaches its maximum (i.e., it wraps around), the value of the Usefulcounter is matched against two preset thresholds to determine if the Lookahead-Counter-initially set to one-should be changed. If the Useful- Counter exceeds the upper threshold, we are in a phase of execution where the program could benefit from a higher degree of prefetching and therefore the LookaheadCounter is incremented. If the Usefulcounter is lower than the lower preset threshold, the amount of prefetching is too high and the LookaheadCounter is
decremented. Finally, if the Usefulcounter has a value between the two thresholds, the LookaheadCounter is not affected. In all cases, the Usefulcounter is cleared. In our evaluation, we have considered counters modulo 16 (4 bits).

When the LookaheadCounter reaches zero, prefetching is turned off. To turn it back on, we use the following mechanism. When a block is received on a read miss and prefetching is turned off, the ZeroBit in the corresponding SLC block frame, which is initially cleared, is set to indicate that the following block in the address space could have been prefetched
and the PrefetchCounter is incremented. On a read miss, a cache lookup is made to the previous block (by address); if it
hits and the ZeroBit is set, the UsefulCounter is incremented and the ZeroBit is cleared. The ZeroBit of a block is also
cleared when the block is accessed and the LookaheadCounter is not zero to keep the number of ZeroBits that have been previously set to a minimum.

==Chip Multiprocessing Prefetching (CMP)==
[[File:PrefetchImp.png|thumb|right|upright|300px|Figure 2.3. Prefetch Implementation]]
Prefetching the lowest miss address stream in the cache hierarchy has many advantages, particularly in a CMP system. First, in a CMP, the L2 cache is often shared by all processors on the chip. Consequently, prefetching the L2 miss address stream can share prefetch history among the processors, resulting in larger history tables. Second, prefetching L2 miss addresses reduces contention on the cache ports, which is becoming increasingly important as the number of processors per chip grows. Before a prefetch is sent to the memory subsystem, it must access the L2 directory. Since the L2 miss address stream has the fewest memory references it will generate less prefetches and access the cache ports less often. Last, prefetching into the L1 is relatively insignificant, since modern out-of-order processors can tolerate most L1 data cache misses with relatively little performance degradation. Prefetching in a CMP is more difficult than in a uniprocessor system. In addition to limited bandwidth and increased latency (as described earlier), cache coherency protocols play an important role in CMP prefetching.

==Disadvantages of Prefetching==

1. Increased Complexity and overhead of handling the perfetching algorithms. Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.

2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.

3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.

=References=
<references/>

2013-02-24T07:50:24Z

Scanjee: /* Write-Back/Ownership Schemes */

=Cache Hierarchy=
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy[[#9foot|[9]]]]]
In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.

To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.

There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write. 

Cache management is impacted by three characteristics of modern processor architectures: multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.

Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace. Two-level and three-level cache hierarchies are common. L1 typically ranges from 16-64KB and provide access in 2-4 cycles. L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles. L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#10foot|[10]]]

Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.

In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor. Typically, lower-level caches are private and high-level caches may or may not be shared. Notice that none of the examples in the below have a shared L1 cache. Cache management functions must consider both shared and private caches when reading and writing data from memory.
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy[[#2foot|[2]]]]]
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy[[#3foot|[3]]]]]

{| border='1' class="wikitable" style="text-align:center"
|+style="white-space:nowrap"|Table 1: Cache on different Microprocessors
|-
! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year
|-
| Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006
|-
| Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007
|-
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007
|-
| AMD Athlon 64FX || 2 || I:64KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007
|-
| AMD Athlon 64X2 || 2 || I:64KB D:4KB Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007
|-
| AMD Barcelona || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Aug 2007
|-
| Sun Microsystems Ultra Sparc T2 || 8 || I:16KB D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007
|-
| Intel Xeon Wolfdale DP || 2 || D:96KB || 6MB || - || Nov 2007
|-
| Intel Xeon Hapertown || 4 || D:96KB || 2*6MB || - || Nov 2007
|-
| AMD Phenom || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Nov 2007 Mar 2008
|-
| Intel Core 2 Duo || 2 || I:32KB D:32KB || 2/4MB 8 way set assoc. || - || 2008
|-
| Intel Penryn Wolfdale DP || 4 || - || 6-12MB || 6MB || Mar 2008 Aug 2008
|-
| Intel Core 2 Quad Yorkfield || 4 || D:96KB || 12MB || - || Mar 2008
|-
| AMD Toliman || 3K10 || I:64KB D:64KB || 512KB || 2MB Shared || Apr 2008
|-
| Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008
|-
| IBM RoadRunner || 8+1 || 32KB || 512KB || - || Jun 2008
|}

=Cache Write Policies=

In section 6.2.3[[#10foot|[10]]], cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.
The write miss policies covered in the text[[#10foot|[10]]], write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.

==Write hit policies==
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison[[#5foot|[5]]]]]
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.
**Advantages:
***Easy to implement
***Main memory has most recent copy of the data
***Read misses never result in writes to main memory
**Disadvantages:
***Every write needs to access main memory
***Bandwidth intensive
***Writes are slower
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.
**Advantages:
***Writes are as fast as the speed of the cache memory
***Multiple writes to a block require one write to main memory
***Less bandwidth intensive
**Disadvantages:
***Harder to implement
***Main memory may not be consistent with cache
***Reads that result in data replacement may cause dirt blocks to be written to main memory

==Write miss policies==
Write miss policies can help eliminate write misses and therefore reduce bus traffic. This reduces the wait-time of all processors sharing the bus.

*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.
*Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.

==Combination Policies==
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.

* Write-validate: It is a combination of no-fetch-on-write and write-allocate[[#4foot|[4]]]. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate[[#4foot|[4]]]. This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.
* Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit[[#4foot|[4]]]. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.

Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.

* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected. This policy was used as the base-line for determining the performance of the other three policy combinations. In general, all three previous combinations exhibited fewer misses than this policy.

=Prefetching=
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the prefetched data will be used. Goal here is to reduce cache misses to improve overall performance. 
Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data[[#8foot|[8]]].

==Advantages==
*Improves overall performance by reducing cache misses.
==Disadvantages==
* Wastes bandwidth when prefetched data is not used.
* Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.
* Software prefetching adds additional instructions to the program, making the program larger.
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.
* When scheduler changes the task running on a processor, prefetched data may become useless.

==Effectiveness==
Prefetching effectiveness can be tracked by following matrices
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.
# Accuracy is defined as fraction of prefetches that are useful.
# Timeliness measures how early the prefetches arrive.
Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.

==Stream Buffer Prefetching==
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.
 
 
[[Image:Cache_hit_improvements.jpg|center]]
 
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite[[#7foot|[7]]].

==Prefetching in Parallel Computing==
On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system. 
In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost[[#6foot|[6]]].
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.
 
Following are two examples of processors that support prefetching directly in the chip design:

==Prefetching in Intel Core i7[[#11foot|[11]]]==
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.

The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently. The processor assumes that this ascending access will continue, and prefetches the next line.

The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory: one for the upstreams that has 12 entries, and one for downstreams that has 4 entires. As pages are accessed, their addresses are tracked in these arrays. When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.

==Prefetching in AMD[[#12foot|[12]]]==

The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)

The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though. When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory. For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected. Subsequent detection of misses of sequential blocks may only prefetch a single block. This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.

AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream. In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU. AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing.

The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.

=Cache Coherence Support=
<ref>[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]</ref>Cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.

Cache coherence is achieved if the following conditions are met.
* If a processor P1 writes a value A to a memory location X and reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.

==Software vs. Hardware solutions==

Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance.

<ref>[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon</ref>Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.

==Cache Coherence Schemes – Fetch and Replacements==

===Invalidation Schemes vs. Update Strategies<ref>[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas</ref>===

* Invalidation: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.

The strategies followed for fetch and replacement in some of the schemes are discussed below.

===Snoopy Cache Coherence Schemes===
#Snooping<ref>[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]</ref> is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.

===Directory Based Cache Coherence Schemes===
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.
#Fetch and Replacement Scenarios<ref>[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti</ref>:
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs.
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.
##Block is Shared: The block requested by a Processor P1 is in shared state.
##*Read Miss: Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.
##*Write Miss: Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.

===Write Through Schemes===
#In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.

===Write-Back/Ownership Schemes===
#In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth.

===Pointer-Based Coherence Schemes=== <ref>[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry</ref>
#The Full Bit Vector Schemes
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.
#Limited Pointer Schemes
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.

==Cache Coherency protocol==
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.

Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.

Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.

=References=
[[#1body|1.]] http://www.real-knowledge.com/memory.htm 
[[#2body|2.]] Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg 
[[#3body|3.]] Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin 
[[#4body|4.]] “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. 
[[#5body|5.]] Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer 
[[#6body|6.]] “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) 
[[#7body|7.]] “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) 
[[#8body|8.]] http://en.wikipedia.org/wiki/Instruction_prefetch 
[[#9body|9.]] http://www.real-knowledge.com/memory.htm 
[[#10body|10.]] Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151 
[[#11body|11.]] "Intel® 64 and IA-32 Architectures Optimization Reference Manual", Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf 
[[#12body|12.]] "Software Optimization Guide for AMD Family 10h and 12h Processors", Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf 

=Other References=
<references/>

CSC/ECE 506 Spring 2013/6a cs

2013-02-24T07:50:01Z

Scanjee: /* Pointer-Based Coherence Schemes */

=Cache Hierarchy=
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy[[#9foot|[9]]]]]
In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.

To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.

There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write. 

Cache management is impacted by three characteristics of modern processor architectures: multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.

Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace. Two-level and three-level cache hierarchies are common. L1 typically ranges from 16-64KB and provide access in 2-4 cycles. L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles. L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#10foot|[10]]]

Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.

In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor. Typically, lower-level caches are private and high-level caches may or may not be shared. Notice that none of the examples in the below have a shared L1 cache. Cache management functions must consider both shared and private caches when reading and writing data from memory.
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy[[#2foot|[2]]]]]
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy[[#3foot|[3]]]]]

{| border='1' class="wikitable" style="text-align:center"
|+style="white-space:nowrap"|Table 1: Cache on different Microprocessors
|-
! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year
|-
| Intel Pentium Dual Core || 2 || I:32KB D:32KB || 1MB 8 way set assoc. || - || 2006
|-
| Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007
|-
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007
|-
| AMD Athlon 64FX || 2 || I:64KB D:4KB Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007
|-
| AMD Athlon 64X2 || 2 || I:64KB D:4KB Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007
|-
| AMD Barcelona || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Aug 2007
|-
| Sun Microsystems Ultra Sparc T2 || 8 || I:16KB D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007
|-
| Intel Xeon Wolfdale DP || 2 || D:96KB || 6MB || - || Nov 2007
|-
| Intel Xeon Hapertown || 4 || D:96KB || 2*6MB || - || Nov 2007
|-
| AMD Phenom || 4 || I:64KB D:64KB || 512KB || 2MB Shared || Nov 2007 Mar 2008
|-
| Intel Core 2 Duo || 2 || I:32KB D:32KB || 2/4MB 8 way set assoc. || - || 2008
|-
| Intel Penryn Wolfdale DP || 4 || - || 6-12MB || 6MB || Mar 2008 Aug 2008
|-
| Intel Core 2 Quad Yorkfield || 4 || D:96KB || 12MB || - || Mar 2008
|-
| AMD Toliman || 3K10 || I:64KB D:64KB || 512KB || 2MB Shared || Apr 2008
|-
| Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008
|-
| IBM RoadRunner || 8+1 || 32KB || 512KB || - || Jun 2008
|}

=Cache Write Policies=

In section 6.2.3[[#10foot|[10]]], cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.
The write miss policies covered in the text[[#10foot|[10]]], write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.

==Write hit policies==
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison[[#5foot|[5]]]]]
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.
**Advantages:
***Easy to implement
***Main memory has most recent copy of the data
***Read misses never result in writes to main memory
**Disadvantages:
***Every write needs to access main memory
***Bandwidth intensive
***Writes are slower
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.
**Advantages:
***Writes are as fast as the speed of the cache memory
***Multiple writes to a block require one write to main memory
***Less bandwidth intensive
**Disadvantages:
***Harder to implement
***Main memory may not be consistent with cache
***Reads that result in data replacement may cause dirt blocks to be written to main memory

==Write miss policies==
Write miss policies can help eliminate write misses and therefore reduce bus traffic. This reduces the wait-time of all processors sharing the bus.

*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.
*Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.

==Combination Policies==
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.

* Write-validate: It is a combination of no-fetch-on-write and write-allocate[[#4foot|[4]]]. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate[[#4foot|[4]]]. This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.
* Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit[[#4foot|[4]]]. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.

Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.

* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected. This policy was used as the base-line for determining the performance of the other three policy combinations. In general, all three previous combinations exhibited fewer misses than this policy.

=Prefetching=
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the prefetched data will be used. Goal here is to reduce cache misses to improve overall performance. 
Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data[[#8foot|[8]]].

==Advantages==
*Improves overall performance by reducing cache misses.
==Disadvantages==
* Wastes bandwidth when prefetched data is not used.
* Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.
* Software prefetching adds additional instructions to the program, making the program larger.
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.
* When scheduler changes the task running on a processor, prefetched data may become useless.

==Effectiveness==
Prefetching effectiveness can be tracked by following matrices
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.
# Accuracy is defined as fraction of prefetches that are useful.
# Timeliness measures how early the prefetches arrive.
Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.

==Stream Buffer Prefetching==
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.
 
 
[[Image:Cache_hit_improvements.jpg|center]]
 
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite[[#7foot|[7]]].

==Prefetching in Parallel Computing==
On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system. 
In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost[[#6foot|[6]]].
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.
 
Following are two examples of processors that support prefetching directly in the chip design:

==Prefetching in Intel Core i7[[#11foot|[11]]]==
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.

The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently. The processor assumes that this ascending access will continue, and prefetches the next line.

The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory: one for the upstreams that has 12 entries, and one for downstreams that has 4 entires. As pages are accessed, their addresses are tracked in these arrays. When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.

==Prefetching in AMD[[#12foot|[12]]]==

The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)

The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though. When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory. For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected. Subsequent detection of misses of sequential blocks may only prefetch a single block. This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.

AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream. In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU. AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing.

The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.

=Cache Coherence Support=
<ref>[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]</ref>Cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.

Cache coherence is achieved if the following conditions are met.
* If a processor P1 writes a value A to a memory location X and reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.

==Software vs. Hardware solutions==

Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance.

<ref>[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon</ref>Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.

==Cache Coherence Schemes – Fetch and Replacements==

===Invalidation Schemes vs. Update Strategies<ref>[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas</ref>===

* Invalidation: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.

The strategies followed for fetch and replacement in some of the schemes are discussed below.

===Snoopy Cache Coherence Schemes===
#Snooping<ref>[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]</ref> is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.

===Directory Based Cache Coherence Schemes===
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.
#Fetch and Replacement Scenarios<ref>[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti</ref>:
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs.
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.
##Block is Shared: The block requested by a Processor P1 is in shared state.
##*Read Miss: Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.
##*Write Miss: Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.

===Write Through Schemes===
#In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.

===Write-Back/Ownership Schemes===
#In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth.

===Pointer-Based Coherence Schemes===<ref>[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry</ref>
#The Full Bit Vector Schemes
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.
#Limited Pointer Schemes
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.

==Cache Coherency protocol==
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.

Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.

Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.

=References=
[[#1body|1.]] http://www.real-knowledge.com/memory.htm 
[[#2body|2.]] Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg 
[[#3body|3.]] Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin 
[[#4body|4.]] “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. 
[[#5body|5.]] Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer 
[[#6body|6.]] “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887) 
[[#7body|7.]] “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2) 
[[#8body|8.]] http://en.wikipedia.org/wiki/Instruction_prefetch 
[[#9body|9.]] http://www.real-knowledge.com/memory.htm 
[[#10body|10.]] Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151 
[[#11body|11.]] "Intel® 64 and IA-32 Architectures Optimization Reference Manual", Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf 
[[#12body|12.]] "Software Optimization Guide for AMD Family 10h and 12h Processors", Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf 

=Other References=
<references/>