<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Scanjee</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Scanjee"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Scanjee"/>
	<updated>2026-06-23T20:48:42Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74653</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74653"/>
		<updated>2013-04-03T18:50:12Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Prefetching Improvements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:NewCacheSize.jpg&amp;diff=74650</id>
		<title>File:NewCacheSize.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:NewCacheSize.jpg&amp;diff=74650"/>
		<updated>2013-04-03T18:49:14Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74649</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74649"/>
		<updated>2013-04-03T18:46:39Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Prefetching Improvements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
[[Image:newCacheSize.jpg |thumb|right|300px| Cache Size]]&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74648</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74648"/>
		<updated>2013-04-03T18:45:36Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Prefetching Improvements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
[[Image: |thumb|right|300px| Cache Size]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74647</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74647"/>
		<updated>2013-04-03T18:42:37Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Prefetching Improvements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74646</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74646"/>
		<updated>2013-04-03T18:41:57Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Prefetching Improvements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:Cachesize.jpg|thumb|right|300px|Cache Size]]&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74645</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74645"/>
		<updated>2013-04-03T18:41:09Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Memory Consistency modelshttp://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_drhttp://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
[[Image:2.1.jpg|thumb|right|300px| Sequential Consistency Example]]&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74644</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74644"/>
		<updated>2013-04-03T18:40:10Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* An overview of memory consistency models */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
[[Image:2.1.jpg|thumb|right|300px|Figure 3.1. Sequential Consistency Example]]&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74642</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74642"/>
		<updated>2013-04-03T18:37:26Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* &amp;quot;Disadvantages of Prefetching&amp;quot;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
[[File:Figure 3.1.png|thumb|center|upright|350px|Figure 3.1. Memory Consistency Models]]&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
=='''Disadvantages of Prefetching'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74641</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74641"/>
		<updated>2013-04-03T18:36:56Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Disadvantages of Prefetchinghttp://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
[[File:Figure 3.1.png|thumb|center|upright|350px|Figure 3.1. Memory Consistency Models]]&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
==&amp;quot;Disadvantages of Prefetching&amp;quot;&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74640</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74640"/>
		<updated>2013-04-03T18:31:29Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[https://docs.google.com/a/ncsu.edu/document/d/18ap34rWJImZjMAtGbANK9lZMon4NgwqDb5UY9Z8ksLM Link to write up]&lt;br /&gt;
&lt;br /&gt;
'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
Previous articles&lt;br /&gt;
&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr Article 1]&lt;br /&gt;
# [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_jp Article 2]&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
1. A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
2. A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
3. An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
* Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
* Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
* Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
* Works with existing applications&lt;br /&gt;
&lt;br /&gt;
* Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
* Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
* Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
1).''always prefetch'' - prefetch the next block on each reference &lt;br /&gt;
&lt;br /&gt;
2).''prefetch on miss'' - prefetch the next block only on a miss &lt;br /&gt;
&lt;br /&gt;
3).''tagged prefetch'' - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching P blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=='''Memory Consistency models'''&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
A basic requirement for the implementation of a shared multiprocessor system is to correctly order the reads or writes to any memory location which are seen by any processor attempting to access it. This requirement is called as the memory consistency requirement. The memory consistency models adhere to this requirement. The models are implemented in a system at machine code interface as well as high level language interface. They are implemented while translating a high level code to machine level code which is directly executed. Therefore the consistency models make the memory operations predictable. &lt;br /&gt;
&lt;br /&gt;
==='''An overview of memory consistency models'''===&lt;br /&gt;
We can classify consistency models as those conforming to the programmer’s intuition and those which do not. The programmer’s implicit expectation in the ordering of memory accesses is that the memory accesses should follow the order given in the program and they should occur instantaneously, i.e. atomically. The expectations are completely captured by the [http://en.wikipedia.org/wiki/Sequential_consistency ''Sequential Consistency''] model, which was defined by [http://en.wikipedia.org/wiki/Leslie_Lamport Leslie Lamport] as follows:&lt;br /&gt;
&lt;br /&gt;
“''A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.''”&lt;br /&gt;
&lt;br /&gt;
This model restricts the reordering of any read or store instructions and thus restricts compiler optimization techniques.  Another memory consistency model which allows the reordering of operations is called as the relaxed consistency model. As the reads and writes can be reordered, they deviate from the programmer’s expectation. Some of the relaxed consistency models include processor consistency, weak ordering, and release consistency models.&lt;br /&gt;
&lt;br /&gt;
''Processor consistency model (PC)'' - Processor consistency model relaxes the requirement of all stores to be completed before a load operation to take place i.e. a more recent load can be completed before the store preceding is completed. &lt;br /&gt;
&lt;br /&gt;
''Weak ordering (WO)'' – Weak ordering separates all read and write instructions into critical sections.  Reordering the critical section themselves is not allowed by this model, but ordering of instructions within the section itself is allowed. These critical sections are separated by synchronization events. The synchronization events can be implemented in the form of locks, barriers or post – wait pairing. Therefore the requirements for the weak ordering model to work are: &lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly. &lt;br /&gt;
* All read and write operations before a synchronization event have been completed. &lt;br /&gt;
* All loads and stores following a critical section cannot precede the section.&lt;br /&gt;
&lt;br /&gt;
''Release consistency model''- This model classifies synchronization events as acquire and release operations. Acquire operations allows a thread to enter a critical section while a thread leaves the critical section after a release operation. Hence an acquire operation is not concerned about older read / write operations as that is requirement is taken care of by the release operation and similarly a release operation is concerned about younger read /writes as it is handled by the acquire operation. Hence we can rewrite the requirements for weak ordering to follow release consistency model as follows:&lt;br /&gt;
* The programmer has clearly defined the critical sections and has implemented the synchronization of the critical section explicitly.&lt;br /&gt;
* All read and writes before a release operation should be completed&lt;br /&gt;
* All acquire operations related to a critical section should be completed before handling a younger read write. &lt;br /&gt;
* The acquire and release operations should be atomic with respect to each other. &lt;br /&gt;
A core difference between weak ordering and release consistency is that in release consistency overlapping of critical sections is possible.&lt;br /&gt;
Figure 3.1 shows how a sequence of instructions will be executed following a particular consistency model.  As compared to sequential consistency, relaxed consistency models provide faster execution. Now if we can prefetch instructions while conforming to the consistency model, then that will result in an even higher speed up.&lt;br /&gt;
[[File:Figure 3.1.png|thumb|center|upright|350px|Figure 3.1. Memory Consistency Models]]&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching under consistency models'''&amp;lt;ref&amp;gt;http://parasol.tamu.edu/~rwerger/Courses/654/pref1.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
Prefetching can also be classified as binding and non- binding. In binding prefetching, the value of the prefetched data is locked the instant it is prefetched and will be released only when the process reads it later via a regular read. Other processes cannot modify the data during this period.  In non-binding prefetching , other processors can modify the data till the actual regular read is done. An advantage of using non -binding prefetching is that it does not affect the correctness of any consistency model.&lt;br /&gt;
Let’s see the improvement in execution time using prefetching by considering a set of instructions  given in example 4.1 and 4.2. For both sets, the cache hit latency is 1 cycle while the cache miss latency is 100 cycles. An invalidation based cache coherence scheme is used in both the examples.  &lt;br /&gt;
&lt;br /&gt;
Example 4.1:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  write A	(miss)&lt;br /&gt;
  write B	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
&lt;br /&gt;
Example 4.2:&lt;br /&gt;
  lock L		(miss)&lt;br /&gt;
  read C		(miss)&lt;br /&gt;
  read D		(hit)&lt;br /&gt;
  read E[D]	(miss)&lt;br /&gt;
  unlock L	(hit)&lt;br /&gt;
In sequential consistency (SC), reordering of operations is not allowed. Hence for example 4.1, 3 cache misses and a cache hit take a total of 301 cycles to execute. Under relaxed consistency (RC), the writes can be pipelined within the critical region bounded by the lock. Hence under RC, the total time for execution is 202 cycles as the second write is a cache hit.  Prefetching improves the performance of systems following either SC or RC. Using prefetching, we assume that lock acquisition is bound to succeed and hence prefetch both the writes. When lock acquisition actually succeeds, we already have data for both the writes prefetched in our cache. Hence all writes incur a cache hit and the total number of cycles required to execute the set will be reduced to 103 cycles for both consistency models.&lt;br /&gt;
&lt;br /&gt;
A case where prefetching does not improve execution time is given in example 4.2. The read instructions indicate a data dependency between the third and second read. Under SC, the instructions take 302 cycles to perform. Under RC, they take 203 cycles as reading the value of E is a cache hit using pipelining within the critical section. With the prefetching, the instructions take 203 cycles under SC and 202 cycles under RC. This reduction in improvement of execution time is attributed to the data dependency between the D and E (under SC) and between lock acquisition and D (under RC).  Hence prefetching fails to improve the performance in execution time in such cases.&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
* Increased Complexity and overhead of handling the prefetching algorithms-  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
* With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
* If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=='''Conclusion'''==&lt;br /&gt;
The article dealt with the concept of prefetching, its hardware and software implementation. It gave a brief overview of few of the memory consistency models implemented to date and how prefetching can affect the execution time across different consistency models. Overall, prefetching can genuinely help in lowering execution time if out-of-order execution does not take place. However it cannot help in overlapping of memory accesses; the processor will get the loads in the order determined in the program. The only difference is that it will get the data from the cache rather than the main memory because it has been prefetched.&lt;br /&gt;
&lt;br /&gt;
=='''Quiz'''==&lt;br /&gt;
1.	In example 4.1, if we reorder the writes within the critical section and as a result write B occurs before write A, will this result in lower execution time in either Sequential consistency (SC) or Relaxed consistency (RC)?&lt;br /&gt;
&lt;br /&gt;
#	Yes in both models&lt;br /&gt;
#	Yes in SC, No in RC&lt;br /&gt;
#	No in SC, Yes in RC&lt;br /&gt;
#	No in both SC and RC.&lt;br /&gt;
&lt;br /&gt;
2.	___________ model categorizes the synchronization operation into Acquire and Release.&lt;br /&gt;
&lt;br /&gt;
#	Sequential Consistency&lt;br /&gt;
#	Release Consistency&lt;br /&gt;
#	Weak Ordering &lt;br /&gt;
#	Processor Consistency.&lt;br /&gt;
&lt;br /&gt;
3.	Which of these is &amp;lt;i&amp;gt;not &amp;lt;/i&amp;gt;a type of hardware implementation of a prefetcher?&lt;br /&gt;
&lt;br /&gt;
#	Predication&lt;br /&gt;
#	Stride&lt;br /&gt;
#	Stream&lt;br /&gt;
#	Adjacent Cache line prefetcher&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74557</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74557"/>
		<updated>2013-04-02T01:43:55Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74554</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74554"/>
		<updated>2013-04-02T01:41:11Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Disadvantages of Prefetching */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching&amp;lt;ref&amp;gt;http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/10a_dr&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74553</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74553"/>
		<updated>2013-04-02T01:40:19Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Prefetching and Memory Consistency Models */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching==&lt;br /&gt;
&lt;br /&gt;
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74552</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74552"/>
		<updated>2013-04-02T01:39:28Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Hardware-based Prefetching Techniqueshttp://suif.stanford.edu/papers/mowry92/subsection3_5_2.htmlhttp://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
&lt;br /&gt;
Most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching==&lt;br /&gt;
&lt;br /&gt;
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74551</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74551"/>
		<updated>2013-04-02T01:38:42Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching==&lt;br /&gt;
&lt;br /&gt;
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74549</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74549"/>
		<updated>2013-04-02T01:37:06Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Software and Hardware Prefetchinghttp://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement [http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html hardware prefetching]. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74548</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74548"/>
		<updated>2013-04-02T01:35:34Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Software and Hardware Prefetchinghttp://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An [http://www.techarp.com/showfreebog.aspx?lang=0&amp;amp;bogno=282 adjacent cache line prefetcher] automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74546</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74546"/>
		<updated>2013-04-02T01:34:28Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Software and Hardware Prefetchinghttp://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A [http://www.bergs.com/stefan/general/general_slides/sld011.htm stride] prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74544</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74544"/>
		<updated>2013-04-02T01:32:09Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques&amp;lt;ref&amp;gt;http://suif.stanford.edu/papers/mowry92/subsection3_5_2.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://static.usenix.org/event/fast07/tech/full_papers/gill/gill_html/node4.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74541</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74541"/>
		<updated>2013-04-02T01:30:37Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching&amp;lt;ref&amp;gt;http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture&amp;lt;/ref&amp;gt;====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74539</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74539"/>
		<updated>2013-04-02T01:29:54Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetchinghttp://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74537</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74537"/>
		<updated>2013-04-02T01:29:20Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching&amp;lt;ref&amp;gt;http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_linux/manual_html/ch_intro_prefetch.html&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74535</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74535"/>
		<updated>2013-04-02T01:28:01Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
Some of the most commonly used Prefetching techniques are discussed below.&lt;br /&gt;
The most common prefetching approach is to perform sequential readahead. The simplest form is One Block Lookahead (OBL), where we prefetch one block beyond the requested block. OBL can be of three types: &lt;br /&gt;
&lt;br /&gt;
(i) always prefetch - prefetch the next block on each reference &lt;br /&gt;
(ii) prefetch on miss - prefetch the next block only on a miss &lt;br /&gt;
(iii) tagged prefetch - prefetch the next block only if the referenced block is accessed for the first time. &lt;br /&gt;
&lt;br /&gt;
P-Block Lookahead extends the idea of OBL by prefetching   blocks instead of one, where   is also referred to as the degree of prefetch. A variation of the P-Block Lookahead algorithm which dynamically adapts the degree of prefetch for the workload.&lt;br /&gt;
&lt;br /&gt;
A per stream scheme selects the appropriate degree of prefetch on each miss based on a prefetch degree selector (PDS) table. For the case where cache is abundant, Infinite-Block Lookahead has also been studied.&lt;br /&gt;
Stride-based prefetching has also been studied mainly for processor caches where strides are detected based on information provided by the application, a lookahead into the instruction stream, or a reference prediction table indexed by the program counter. Sequential prefetching can be consider as a better choice because most strides lie within the block size and it can also exploit locality.&lt;br /&gt;
History-based prefetching has been proposed in various forms. A history-based table can be used to predict the next pages to prefetch. In a variant of history based prefetching, multiple memory predictions are prefetched at the same time. Data compression techniques have also been applied to predict future access patterns. &lt;br /&gt;
The fact is, most commercial data storage systems use very simple prefetching schemes like sequential prefetching. This is because only sequential prefetching can achieve a high long-term predictive accuracy in data servers. Strides that cross page or track boundaries are uncommon in workloads and therefore not worth implementing. History-based prefetching suffers from low predictive accuracy and the associated cost of the extra reads on an already bottlenecked I/O system. The data storage system cannot use most hardware-initiated or software-initiated prefetching techniques as the applications typically run on external hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74532</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74532"/>
		<updated>2013-04-02T01:14:29Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references.  &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74531</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74531"/>
		<updated>2013-04-02T01:14:00Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
===Hardware-based Prefetching Techniques===&lt;br /&gt;
Based on the previous memory access, the hardware based techniques can be employed to dynamically predict the memory address to prefetch. These techniques can be classified into two main classes:&lt;br /&gt;
* On-chip Schemes: Based on the addresses required by the processor in all data references. An example of this type of scheme is the &lt;br /&gt;
* Off-chip Schemes: Based on the addresses that result in L1 cahce misses &lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74510</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74510"/>
		<updated>2013-04-02T00:49:21Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74509</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74509"/>
		<updated>2013-04-02T00:48:39Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching===&lt;br /&gt;
&lt;br /&gt;
With &amp;lt;b&amp;gt;software prefetching&amp;lt;/b&amp;gt; the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.&lt;br /&gt;
&lt;br /&gt;
A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.&lt;br /&gt;
&lt;br /&gt;
If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many modern processors implement &amp;lt;b&amp;gt;hardware prefetching&amp;lt;/b&amp;gt;. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74508</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74508"/>
		<updated>2013-04-02T00:47:27Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. &lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
&lt;br /&gt;
* Software Prefetching&lt;br /&gt;
&lt;br /&gt;
* Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching===&lt;br /&gt;
&lt;br /&gt;
====Software Prefetching====&lt;br /&gt;
With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.&lt;br /&gt;
&lt;br /&gt;
A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.&lt;br /&gt;
&lt;br /&gt;
If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
====Hardware Prefetching====&lt;br /&gt;
Many modern processors implement hardware prefetching. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/10a_jp&amp;diff=74507</id>
		<title>CSC/ECE 506 Spring 2012/10a jp</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/10a_jp&amp;diff=74507"/>
		<updated>2013-04-02T00:46:17Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* == */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
=='''Introduction'''==&lt;br /&gt;
&lt;br /&gt;
Buffering and pipelining are attractive techniques for hiding the latency of memory accesses in large scale shared-memory multiprocessors. However, the unconstrained use of these techniques can result in an intractable programming model for the machine. Consistency models provide more tractable programming models by introducing various restrictions on the amount of buffering and pipelining allowed.&lt;br /&gt;
Several memory consistency models have been proposed in the literature. The strictest model is sequential consistency (which requires the execution of a parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine.Sequential consistency imposes severe restrictions on buffering and pipelining of memory accesses. One of the least strict models is release consistency (RC) , which allows significant overlap of memory accesses given synchronization accesses are identified and classified into acquires and releases. Other relaxed models that have been discussed in the literature are processor consistency (PC), weak consistency (WC), and data-race-free-0 (DRF0). These models fall between sequential and release consistency models in terms of strictness.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching'''==&lt;br /&gt;
&lt;br /&gt;
The delay constraints imposed by a consistency model limit the amount of buffering and pipelining among memory accesses. Prefetching provides one method for increasing performance by partially proceeding with an access that is delayed due to consistency model constraints. &lt;br /&gt;
Prefetches can be initiated either by an explicit fetch operation within a program, by logic that monitors the processor’s referencing pattern to infer prefetching, or by a combination of these approaches. However they are initiated, prefetches must be issued in a timely manner. If a prefetch is issued too early there is a chance that the prefetched data will displace other useful data from the higher levels of the memory hierarchy or be displaced itself before use. If the prefetch is issued too late, it may not arrive before the actual memory reference and thereby introduce processor stall cycles. Software prefetching issues fetches only for data that is likely to be used while hardware schemes tend data in a more speculative manner. &lt;br /&gt;
&lt;br /&gt;
The decision of where to place prefetched data in the memory hierarchy is a fundamental design decision. Clearly, data must be moved into a higher level of the memory hierarchy to provide a performance benefit. The majority of schemes place prefetched data in some type of cache memory. Other schemes place prefetched data in dedicated buffers to protect the data from premature cache evictions and prevent cache pollution. When prefetched data are placed into named locations, such as processor registers or memory, the prefetch is said to be binding and additional constraints must be imposed on the use of the data. Finally, multiprocessor systems can introduce additional levels into the memory hierarchy which must be taken into consideration.Data can be prefetched in units of single words, cache blocks, contiguous blocks of memory or program data objects. Often, the amount of data fetched is determined by the organization of the underlying cache and memory system. Cache blocks may be the most appropriate size for uniprocessors and Symmetric Multiprocessors (SMPs) while larger memory blocks may be used to amortize the cost of initiating a data transfer across an interconnection network of a large, distributed memory multiprocessor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==='''Description'''&amp;lt;ref&amp;gt;Two Techniques to Enhance the Performance of Memory Consistency Models by Kourosh Gharachorloo, Anoop Gupta, and John Hennessy&lt;br /&gt;
&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
Prefetching can be classified based on whether it is binding or non-binding, and whether it is controlled by hardware or software. With a binding prefetch, the value of a later reference (e.g., a register load) is bound at the time the prefetch completes. This places restrictions on when a binding prefetch canbe issued, since the value will become stale if another processor modifies the location during the interval between prefetch and reference. Hardware cache-coherent architectures, such as the Stanford DASH multiprocessor, can provide prefetching that is non-binding. With a non-binding prefetch, the data is brought close to the processor (e.g., into the cache) and is kept coherent until the processor actually reads the value. Thus, non-binding prefetching does not affect correctness for any of the consistency models and can be used as simply a performance boosting technique. The technique described in this section assumes hardware-controlled non-binding prefetch.  &lt;br /&gt;
&lt;br /&gt;
Prefetching can enhance performance by partially servicing large latency accesses that are delayed due to consistency model constraints. For a read operation, a read prefetch can be used to bring the data into the cache in a read-shared state while the operation is delayed due to consistency constraints. Since the prefetch is non-binding, we are guaranteed that the read operation will return a correct value once it is allowed to perform, regardless of when the prefetch completed. In the majority of cases, we expect the result returned by the prefetch to be the correct result. The only time the result may be different is if the location is written to between the time the prefetch returns the value and the time the read is allowed to perform. In this case, the prefetched location would either get invalidated or updated, depending on the coherence scheme. If invalidated, the read operation will miss in the cache and access the new value from the memory system, as if the prefetch never occurred. In the case of an update protocol, the location is kept up-to-date, thus providing the new value to the read operation.&lt;br /&gt;
&lt;br /&gt;
For a write operation, a read-exclusive prefetch can be used to acquire exclusive ownership of the line, enabling the write to that location to complete quickly once it is allowed to perform. A read-exclusive prefetch is only possible if the coherence scheme is invalidation-based. Similar to the read prefetch case, the line is invalidated if another processor writes to the location between the time the read-exclusive prefetch completes and the actual write operation is allowed to proceed. In addition, exclusive ownership is surrendered if another processor reads the location during that time.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==='''Software Implementation'''&amp;lt;ref&amp;gt;Fu, J.W.C. and J.H. Patel, “Data Prefetching in Multiprocessor Vector Cache Memories,”&lt;br /&gt;
Proc. 18th International Symposium on Computer Architecture, Toronto, Ont., Canada, May&lt;br /&gt;
1991, p. 54-63.&amp;lt;/ref&amp;gt;===&lt;br /&gt;
[[Image:P4.PNG|thumbnail|right|x500px|Figure 1:Inner product calculation using a) no prefetching, b) simple prefetching, c)prefetching with loop unrolling and d) software pipelining.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
Most contemporary microprocessors support some form of fetch instruction which can be used to implement prefetching. The implementation of a fetch can be as simple as a load into a processor register that has been hardwired to zero. Slightly more sophisticated implementations provide hints to the memory system as to how the prefetched block will be used. Such information may be useful in multiprocessors where data can be prefetched in different sharing states, for example, although particular implementations will vary, all fetch instructions share some common characteristics. Fetches are non-blocking memory operations and therefore require a lockup-free cache that allows prefetches to bypass other outstanding memory operations in the cache.&lt;br /&gt;
&lt;br /&gt;
Prefetches are typically implemented in such a way that fetch instructions cannot cause exceptions. Exceptions are suppressed for prefetches to insure that they remain an optional optimization feature that does not affect program correctness or initiate large and potentially unnecessary overhead, such as page faults or other memory exceptions.The task of choosing where in the program to place a fetch instruction relative to the matching load or store instruction is known as prefetch scheduling.&lt;br /&gt;
&lt;br /&gt;
Fetch instructions may be added by the programmer or by the compiler during an optimization pass. Unlike many optimizations which occur too frequently in a program or are too tedious to implement by hand, prefetch scheduling can often be done effectively by the programmer.Whether hand-coded or automated by a compiler, prefetching is most often used within loops responsible for large array calculations. Such loops provide excellent prefetching opportunities because they are common in scientific codes, exhibit poor cache utilization and often have predictable array referencing patterns. By establishing these patterns at compile-time, fetch instructions can be placed inside loop bodies so that data for a future loop iteration can be prefetched during the current iteration.&lt;br /&gt;
&lt;br /&gt;
As an example of how loop-based prefetching may be used, consider the code segment shown in Figure 1a. This loop calculates the inner product of two vectors, a and b, in a manner similar to the innermost loop of a matrix multiplication calculation. If we assume a four-word cache block, this code segment will cause a cache miss every fourth iteration. We can attempt to avoid these cache misses by adding the prefetch directives shown in Figure 1b. Note that this figure is a source code representation of the assembly code that would be generated by the compiler.The code segment given in Figure 1c removes most cache misses and unnecessary prefetches but further improvements are possible. Note that cache misses will occur during the first iteration of the loop since prefetches are never issued for the initial iteration. Unnecessary prefetches will occur in the last iteration of the unrolled loop where the fetch commands attempt to access data past the loop index boundary. Both of the above problems can be remedied by using software pipelining techniques as shown in Figure 1d.&amp;lt;ref&amp;gt;Luk, C-K. and T.C. Mowry, “Compiler-based Prefetching for Recursive Data Structures,”&lt;br /&gt;
Proc. 7th Conf. on Architectural Support for Programming Languages and Operating&lt;br /&gt;
Systems, Cambridge, MA, October 1996, p. 222-233.&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==='''Hardware Implementation'''===&lt;br /&gt;
&lt;br /&gt;
===='''Simulated Processing Node Architecture'''&amp;lt;ref&amp;gt;Gornish, E.H., E.D. Granston and A.V. Veidenbaum, “Compiler-directed Data Prefetching in&lt;br /&gt;
Multiprocessors with Memory Hierarchies,” Proc. International Conference on&lt;br /&gt;
Supercomputing, Amsterdam, Netherlands, June 1990, p. 354-68.&amp;lt;/ref&amp;gt;====&lt;br /&gt;
[[Image:P1.PNG|thumbnail|right|x200px|Figure 2:Interfacing in Simulated Processing Node Architecture &amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In contrast to software-controlled prefetching, the processors do not need to support specific prefetch instructions.The environment to which these processors are interfaced is shown in figure below.According to the figure, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on wirte misses and is blocking on read misses. The SLC is a direct-mapped write-back cache.For both prefetching techniques, we only prefetch into the SLC. In addition, SLWB keeps track of the outstanding requests ( SLC read miss, prefetch and write requests). The SLC controller deals with all the complexities of the cache coherence protocol and the prefetching mechanism.The SLC snoops on the local bus for consistency actions: if an invalidation to a block residing in the SLC is detected, the copy of teh block is invalidated in both the SLC and the FLC. Requests from the local bus have priority over those from the FLWB. These interferences with the processor accesses are tolerable because most accesses hit in the FLC.&lt;br /&gt;
&amp;lt;ref&amp;gt;Sequential Hardware Prefetching in Shared-Memory Multiprocessors by Fredrik Dahlgren and Michel Dubois&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===='''Sequential Prefetching'''&amp;lt;ref&amp;gt;Data Prefetch Mechanisms by Steven P. VanderWiel and David J. Lilja&lt;br /&gt;
&amp;lt;/ref&amp;gt;====&lt;br /&gt;
[[Image:P2.PNG|thumbnail|right|x300px|Figure 3:Three forms of sequential prefetching: a) Prefetch on miss, b) tagged prefetch and &lt;br /&gt;
c) sequential prefetching with K = 2&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Most (but not all) prefetching schemes are designed to fetch data from main memory into the processor cache in units of cache blocks. It should be noted, however, that multiple word cache blocks are themselves a form of data prefetching. By grouping consecutive memory words into single units, caches exploit the principle of spatial locality to implicitly prefetch data that is likely to be referenced in the near future. The degree to which large cache blocks can be effective in prefetching data is limited by the ensuing cache pollution effects. That is, as the cache block size increases, so does the amount of potentially useful data displaced from the cache to make room for the new block. In shared memory multiprocessors with private caches, large cache blocks may also cause false sharing which occurs when two or more processors wish to access different words within the same cache block and at least one of the accesses is a store. Although the accesses are logically applied to separate words, the cache hardware is unable to make this  distinction since it operates only on whole cache blocks. The accesses are therefore treated as operations applied to a single object and cache coherence traffic is generated to ensure that the changes made to a block by a store operation are seen by all processors caching the block. In the case of false sharing, this traffic is unnecessary since only the processor executing the store references the word being written. &lt;br /&gt;
&lt;br /&gt;
Increasing the cache block size increases the likelihood of two processors sharing data from the same block and hence false sharing is more likely to arise. Sequential prefetching can take advantage of spatial locality without introducing some of the problems associated with large cache blocks. The simplest sequential prefetching schemes are variations upon the one block lookahead (OBL) approach which initiates a prefetch for block b+1 when block b is accessed. For example, a large block may contain one word which is frequently referenced and several other words which are not in use. Assuming an Least recently used (LRU) replacement policy, the entire block will be retained even though only a portion of the block’s data is actually in use. If this large block were replaced with two smaller blocks, one of them could be evicted to make room for more active data. The reason prefetch-on miss is less effective is illustrated in Figure beside where the behavior of each algorithm when accessing three contiguous blocks is shown. Here, it can be seen that a strictly sequential access pattern will result in a cache miss for every other cache block when the prefetchon-miss algorithm is used but this same access pattern results in only one cache miss when employing a tagged prefetch algorithm.&lt;br /&gt;
&lt;br /&gt;
==='''Prefetching on Multiprocessors'''===&lt;br /&gt;
&lt;br /&gt;
This subsection discusses the requirements that the prefetch technique imposes on a multiprocessor architecture. Let us first consider how the proposed prefetch technique can be incorporated into the processor environment. Assume the general case where the processor has a load and a store buffer. The usual way to enforce a consistency model is to delay the issue of accesses in the buffer until certain previous accesses complete. Prefetching can be incorporated in this framework by having the hardware automatically issue a prefetch (read prefetch for reads and read-exclusive prefetch for writes and atomic read-modify writes  for accesses that are in the load or store buffer, but are delayed due to consistency constraints. A prefetch buffer may be used to buffer multiple prefetch requests. Prefetches can be retired from this buffer as fast as the cache and memory system allow.&lt;br /&gt;
&lt;br /&gt;
A prefetch request first checks the cache to see whether the line is already present. If so, the prefetch is discarded. Otherwise the prefetch is issued to the memory system. When the prefetch response returns to the processor, it is placed in the cache. If a processor references a location it has prefetched before the result has returned, the reference request is combined with the prefetch request so that a duplicate request is not sent out and the reference completes as soon as the prefetch result returns. The prefetch technique discussed imposes several requirements on the memory system. Most importantly, the architecture requires hardware coherent caches. In addition, the location to be prefetched needs to be cachable. Also, to be effective for writes, prefetching requires an invalidation-based coherence scheme. In update-based schemes, it is difficult to partially service a write operation without making the new value available to other processors, which results in the write being performed.The strengths and weaknesses of hardware controlled non-binding prefetching are discussed in the next subsection.&lt;br /&gt;
&lt;br /&gt;
=='''Evolution of Prefetching'''==&lt;br /&gt;
&lt;br /&gt;
During the 1990s, one of the major developments in the computer industry was the very rapid increase in the speed of processors.&amp;lt;ref&amp;gt;http://www.multicoreinfo.com/prefetching-multicore-processors/&amp;lt;/ref&amp;gt;  Unfortunately, main memory access did not increase at nearly the same type of rate, thus giving a rather large advantage to the processor and the use of the cache vs RAM.  Having to suffer a cache miss and access memory during this period threatened to undermine all the gains that were being achieved with these new processors.  Thus, new techniques were required to keep data as close to the local processor as possible.  Prefetching helped significantly with this issue.  By having a good algorithm in place to properly identify blocks of memory that are likely to be needed in the near future, even though currently not stored in the cache, the bus latency necessary to locate data from main memory can decrease significantly and performance improved.  We will explore some of the architectures utilizing prefetching and also give a discussion of the pitfalls and future of this cache performance technique.&lt;br /&gt;
&lt;br /&gt;
==='''Prefetching Results and Analysis'''===&lt;br /&gt;
&lt;br /&gt;
Numerous studies have been conducted to create benchmarks for various prefetching designs.  One such study was conducted on the Intel Pentium 3 processor, which showed quite a bit of progress.&lt;br /&gt;
&lt;br /&gt;
===='''Intel Pentium III'''====&lt;br /&gt;
&lt;br /&gt;
The Laboratory of Computer Architecture at the University of Texas&amp;lt;ref&amp;gt;http://lca.ece.utexas.edu/sponsors/dell_project1.html&amp;lt;/ref&amp;gt; ran an analysis of the Pentium III processor and arrived at quite positive conclusions.  Specifically, media instructions were isolated and analyzed in terms of their effectiveness.  It was found that the prefetch instructions added to the media portion of the Intel chip improved clock speed by 13-15%, even without compiler optimizations.  They found that the more an application becomes memory bound, as it often does with media-intensive applications, the more benefit that can be derived from prefetching.  They also found, however, that prefetches needed to be inserted very carefully and that it wasn't often better to only include prefetching strategies where using memory-intensive applications.  Indiscriminate prefetching may actually do more harm than good according to their study.&lt;br /&gt;
&lt;br /&gt;
They also discovered that, in fact, many of the applications did not really benefit from prefetching.  This is because a number of applications actually had relatively high cache hit ratios.  If the cache hit ratios are fairly high, the overhead of prefetching often can even degrade performance slightly.  The best solution is to focus on situations with high cache missing.&lt;br /&gt;
&lt;br /&gt;
In summary, it seems that, while effective, prefetching needs to be used judiciously and that compiler optimizations are often more significant in terms of performance improvement.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Disadvantages'''==&lt;br /&gt;
&lt;br /&gt;
In time, however, with the turn of the century, a number of problems and challenges began to emerge with prefetching.&lt;br /&gt;
&lt;br /&gt;
One common issue with prefetching has always been the increased complexity and overhead of handing the prefetching algorithms.  There is great risk that this overhead can outweigh any benefits if the prefetching algorithm is not accurate, either fetching too early or fetching too late to be effective (as discussed in prior sections).  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
Additional issues, for example, came about with the increasing popularity of multicore architectures.  In a single core architecture, prefetching requests are able to originate simply from one core.  With multiple cores, prefetching requests can originate from a variety of different cores.  This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic.  Additionally, coherence algorithms must account not only for typical sequential consistency issues, but also account for the possibility of data in yet another location, the location of prefetched data.  Flushing data becomes significantly more complicated and cumbersome.&lt;br /&gt;
&lt;br /&gt;
Lastly, and most importantly, is the following ongoing conflict:  If prefetched data is stored in the data cache, then cache conflict (or cache pollution(See [[#Definitions]])), can become a significant burden.  This is because the current and predictive sets of data must exist in the cache at the same time.  Without prefetching, you could use some of this additional space to simply increase the cache size itself.  The solution here would be to add extra hardware to act as a buffer to prevent utilizing this cache space.  This, however, requires extra hardware and utilizes precious CPU space at a time when cost pressures in the PC world were becoming more acute.&lt;br /&gt;
&lt;br /&gt;
=='''Prefetching Improvements'''==&lt;br /&gt;
&lt;br /&gt;
Fortunately a number of techniques have evolved to address many of the concerns about prefetching, thus making it significantly more effective.&lt;br /&gt;
&lt;br /&gt;
==='''Cache Sizing'''===&lt;br /&gt;
&lt;br /&gt;
Increasing the dimensions of the cache greatly improves the performance of prefetching.  This can be done either through increasing the cache size itself or increasing set associativity.  A small cache size is a significant problem with prefetching because conflict misses increase dramatically.  Any benefit you might derive from having fewer cache lookup misses is negated by this higher incidence of conflict miss.  The following study assembled some benchmarks for a given SPARC(See [[#Definitions]]) processor that utilized prefetching.  The following graph shows various levels of cache misses based on various cache sizes.  With 16K and 32K cache sizes, prefetching reduces cache misses to fairly close to 0, which the number of misses without prefetching is rather high.&lt;br /&gt;
&lt;br /&gt;
For example, the following study of a BioInformatics application showed the effects of prefetching cache misses based on the size of the L1 cache.  As can be seen, cache size has a major impact on the miss rate and, subsequently, performance in the system.&amp;lt;ref&amp;gt;http://www.nikolaylaptev.com/master/classes/cs254.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Cachesize.jpg]]&lt;br /&gt;
&lt;br /&gt;
==='''Improved Prefetch Timing'''===&lt;br /&gt;
&lt;br /&gt;
Improving prefetch timing algorithms can also significantly improve performance and prevent the issue with data being prefetched too late or too early to be useful.  One possibility in this case is to implement a technique called Aggressive Prefetching&amp;lt;ref&amp;gt;http://static.usenix.org/events/hotos05/prelim_papers/papathanasiou/papathanasiou_html/paper.html&amp;lt;/ref&amp;gt;.  Traditionally, prefetch algorithms have only incremental amounts of data with higher levels of confidence.  The more aggressive approach, however, seeks to search more deeply in the reference stream to bring in a wider array of data that may be used in the near future.  This does, however, require a more resource-intensive machine to avoid negatively impacting performance.&lt;br /&gt;
&lt;br /&gt;
Greater amounts of processing power and storage have made this more aggressive approach possible.  While older architectures with slower processors and less memory might have had issues with larger volumes of data being prefetched, this is much less of a concern now.  Varying performance from a variety of devices also changes the need to be conservative in prefetching.  Aggressive prefetching can take advantage of some latencies being higher than others, which older, more conservative architectures had more consistent levels of bandwidth throughout.&lt;br /&gt;
&lt;br /&gt;
==='''Memory Pattern Recognition'''===&lt;br /&gt;
&lt;br /&gt;
Effective memory pattern recognition algorithms can go a long way towards resolving issues with prefetching values unnecessarily (ie ones that might not be used in a sufficient timeframe).  One example is the stride technique.  For example, memory addresses that are sequential in nature, like an array, would all be brought in together as a unit since more than likely most or all of these elements will be utilized.  This is an improvement over a more conservative prefetcher that may just assume memory located together will be used.  The more conservative version has been known to have a much higher incidence of bringing in unnecessary blocks of data.&lt;br /&gt;
&lt;br /&gt;
Another variation of this is the linked memory reference pattern.  Linked lists(See [[#Definitions]]) and tree structures are most often associated with this type.  Generally, occurrences of code similar to ptr = ptr-&amp;gt;next are considered as one structure and subsequently brought into memory together.&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For example, instead of:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
the linked memory reference pattern may work in the following way:&lt;br /&gt;
&lt;br /&gt;
 while (p) {&lt;br /&gt;
   prefetch(next(p))&lt;br /&gt;
   work(p)&lt;br /&gt;
   p = next(p)&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Here, p will be brought in ahead of time and be closer to the processor for subsequent use.  This is more efficient than the initial loop.&lt;br /&gt;
&lt;br /&gt;
==='''Markov Prefetcher'''===&lt;br /&gt;
&lt;br /&gt;
This particular prefetch technique&amp;lt;ref&amp;gt;http://www.bergs.com/stefan/general/papers/general.pdf&amp;lt;/ref&amp;gt; is built around the idea of remembering past sequences of cache misses and, utilizing this memory, it will bring it subsequent misses in the sequence.  Generally a rather large table would be used containing previous address misses.  This table is maintained in a similar manner as a cache.  Sequences stored in this table are used to then help predict future sequences that may need to be used by prefetching.&lt;br /&gt;
&lt;br /&gt;
==='''Stealth Prefetching'''===&lt;br /&gt;
&lt;br /&gt;
A more modern and innovative technique, stealth prefetching&amp;lt;ref&amp;gt;http://arnetminer.org/publication/stealth-prefetching-53889.html&amp;lt;/ref&amp;gt;, attempts to deal with issues surrounding the increase of interconnection bandwidth and increased memory latency.  It also helps to avoiding the problem of using bandwidth to prematurely access shared data that will later result in state downgrades.  Basically, this technique attempts to focus on specific regions of memory in a more coarsely designed fashion to identify which segments are heavily used by multiple processors and which ones are not.  Lines that are not being shared by multiple processors are more aggressively brought nearer to the local processor, thus improving performance by as much as 20%.  With the increased use of fast multiprocessors, this alleviates some of the bandwidth bottlenecks.&lt;br /&gt;
&lt;br /&gt;
=='''Definitions'''==&lt;br /&gt;
&lt;br /&gt;
1) '''Linked List''' - In computer science, a linked list is a data structure consisting of a group of nodes which together represent a sequence. Under the simplest form, each node is composed of a datum and a reference (in other words, a link) to the next node in the sequence; more complex variants add additional links. This structure allows for efficient insertion or removal of elements from any position in the sequence&lt;br /&gt;
&lt;br /&gt;
2) '''Cache Pollution''' - Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other needed data to be evicted from the cache into lower levels of the memory hierarchy, potentially all the way down to main memory, thus causing a performance hit.&lt;br /&gt;
&lt;br /&gt;
3) '''SPARC''' - SPARC (from Scalable Processor Architecture) is a RISC instruction set architecture (ISA) developed by Sun Microsystems and introduced in mid-1987.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==See Also==&lt;br /&gt;
http://cdmetcalf.home.comcast.net/~cdmetcalf/papers/prefetch/node6.html&lt;br /&gt;
&lt;br /&gt;
http://www.multicoreinfo.com/prefetching-multicore-processors/&lt;br /&gt;
&lt;br /&gt;
http://www.cs.cmu.edu/~tcm/thesis/subsubsection2_10_1_4_1.html&lt;br /&gt;
&lt;br /&gt;
http://software.intel.com/en-us/articles/how-to-choose-between-hardware-and-software-prefetch-on-32-bit-intel-architecture/&lt;br /&gt;
&lt;br /&gt;
http://www.futurechips.org/chip-design-for-all/prefetching.html&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74504</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74504"/>
		<updated>2013-04-02T00:39:10Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. Prefetching approaches proposed in the literature are software or hardware based.&lt;br /&gt;
[[Image:Prefetching.png]]&lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
1. Software Prefetching&lt;br /&gt;
2. Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software and Hardware Prefetching===&lt;br /&gt;
&lt;br /&gt;
====Software Prefetching====&lt;br /&gt;
With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.&lt;br /&gt;
&lt;br /&gt;
A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.&lt;br /&gt;
&lt;br /&gt;
If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
====Hardware Prefetching====&lt;br /&gt;
Many modern processors implement hardware prefetching. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
====Software vs. Hardware Prefetching====&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74503</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74503"/>
		<updated>2013-04-02T00:37:50Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. Prefetching approaches proposed in the literature are software or hardware based.&lt;br /&gt;
[[Image:Prefetching.png]]&lt;br /&gt;
&lt;br /&gt;
Prefetching can be achieved in two ways:&lt;br /&gt;
1. Software Prefetching&lt;br /&gt;
2. Hardware Prefetching&lt;br /&gt;
&lt;br /&gt;
===Software Prefetching===&lt;br /&gt;
With software prefetching the programmer or compiler inserts prefetch instructions into the program. These are instructions that initiate a load of a cache line into the cache, but do not stall waiting for the data to arrive.&lt;br /&gt;
&lt;br /&gt;
A critical property of prefetch instructions is the time from when the prefetch is executed to when the data is used. If the prefetch is too close to the instruction using the prefetched data, the cache line will not have had time to arrive from main memory or the next cache level and the instruction will stall. This reduces the effectiveness of the prefetch.&lt;br /&gt;
&lt;br /&gt;
If the prefetch is too far ahead of the instruction using the prefetched data, the prefetched cache line will instead already have been evicted again before the data is actually used. The instruction using the data will then cause another fetch of the cache line and have to stall. This not only eliminates the benefit of the prefetch instruction, but introduces additional costs since the cache line is now fetched twice from main memory or the next cache level. This increases the memory bandwidth requirement of the program.&lt;br /&gt;
&lt;br /&gt;
===Hardware Prefetching===&lt;br /&gt;
Many modern processors implement hardware prefetching. This means that the processor monitors the memory access pattern of the running program and tries to predict what data the program will access next and prefetches that data. There are few different variants of how this can be done.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stream&amp;lt;/i&amp;gt; prefetcher looks for streams where a sequence of consecutive cache lines are accessed by the program. When such a stream is found the processor starts prefetching the cache lines ahead of the program's accesses.&lt;br /&gt;
&lt;br /&gt;
A &amp;lt;i&amp;gt;stride&amp;lt;/i&amp;gt; prefetcher looks for instructions that make accesses with regular strides, that do not necessarily have to be to consecutive cache lines. When such an instruction is detected the processor tries to prefetch the cache lines it will access ahead of it.&lt;br /&gt;
&lt;br /&gt;
An &amp;lt;i&amp;gt;adjacent cache line prefetcher&amp;lt;/i&amp;gt; automatically fetches adjacent cache lines to ones being accessed by the program. This can be used to mimic behaviour of a larger cache line size in a cache level without actually having to increase the line size.&lt;br /&gt;
&lt;br /&gt;
===Software vs. Hardware Prefetching===&lt;br /&gt;
Software prefetching has the following characteristics:&lt;br /&gt;
&lt;br /&gt;
1. Can handle irregular access patterns, which do not trigger the hardware prefetcher.&lt;br /&gt;
&lt;br /&gt;
2. Can use less bus bandwidth than hardware prefetching.&lt;br /&gt;
&lt;br /&gt;
3. Software prefetches must be added to new code, and they do not benefit existing applications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The characteristics of the hardware prefetching are as follows :&lt;br /&gt;
&lt;br /&gt;
1. Works with existing applications&lt;br /&gt;
&lt;br /&gt;
2. Requires regular access patterns&lt;br /&gt;
&lt;br /&gt;
3. Start-up penalty before hardware prefetcher triggers and extra fetches after array finishes.&lt;br /&gt;
&lt;br /&gt;
4. Will not prefetch across a 4K page boundary (i.e., the program would have to initiate demand loads for the new page before the hardware prefetcher will start prefetching from the new page).&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74495</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74495"/>
		<updated>2013-04-02T00:29:49Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. Prefetching approaches proposed in the literature are software or hardware based.&lt;br /&gt;
[[Image:Prefetching.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Pf.png&amp;diff=74494</id>
		<title>File:Pf.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Pf.png&amp;diff=74494"/>
		<updated>2013-04-02T00:28:40Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetch.png&amp;diff=74491</id>
		<title>File:Prefetch.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetch.png&amp;diff=74491"/>
		<updated>2013-04-02T00:24:44Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74490</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74490"/>
		<updated>2013-04-02T00:19:52Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;='''Prefetching and Memory Consistency Models'''=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== '''Cache Misses'''==&lt;br /&gt;
Cache misses can be classified into four categories: conflict, compulsory, capacity [3], and coherence. &amp;lt;b&amp;gt;Conflict misses&amp;lt;/b&amp;gt; are misses that would not occur if the cache was fully-associative and had LRU replacement. &amp;lt;b&amp;gt;Compulsory misses&amp;lt;/b&amp;gt; are misses required in any cache organization because they are the first references to an instruction or piece of data. &amp;lt;b&amp;gt;Capacity misses&amp;lt;/b&amp;gt; occur when the cache size is not sufficient to hold data between references. &amp;lt;b&amp;gt;Coherence misses&amp;lt;/b&amp;gt; are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.The number of compulsory and capacity misses can be reduced by employing prefetching techniques such as by increasing the size of the cache lines or by prefetching blocks ahead of time.&lt;br /&gt;
&lt;br /&gt;
== '''Prefetching'''==&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty. Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss. Prefetching approaches proposed in the literature are software or hardware based.&lt;br /&gt;
[[File:Prefetching.png]]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetching.png&amp;diff=74489</id>
		<title>File:Prefetching.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetching.png&amp;diff=74489"/>
		<updated>2013-04-02T00:18:56Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: uploaded a new version of &amp;amp;quot;File:Prefetching.png&amp;amp;quot;: Reverted to version as of 00:17, 2 April 2013&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Illustrates how Prefetching can reduce memory latency&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetching.png&amp;diff=74488</id>
		<title>File:Prefetching.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetching.png&amp;diff=74488"/>
		<updated>2013-04-02T00:18:40Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: uploaded a new version of &amp;amp;quot;File:Prefetching.png&amp;amp;quot;: Illustrates how Prefetching can reduce memory latency&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Illustrates how Prefetching can reduce memory latency&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetching.png&amp;diff=74487</id>
		<title>File:Prefetching.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Prefetching.png&amp;diff=74487"/>
		<updated>2013-04-02T00:17:33Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: Illustrates how Prefetching can reduce memory latency&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Illustrates how Prefetching can reduce memory latency&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74484</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74484"/>
		<updated>2013-04-01T23:46:54Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
== '''Overview''' ==&lt;br /&gt;
This wiki article explores two different topics Sequential Prefetching and Memory Consistency models.  The article covers description of prefetching, different types of prefetching like Fixed, Adaptive, etc explained in detail. This is followed by, different types of memory consistency models like Sequential consistency and Relaxed consistency.  It also talks about the authors' and researchers' comments through examples.&lt;br /&gt;
&lt;br /&gt;
= '''Prefetching''' =&lt;br /&gt;
&lt;br /&gt;
Sequential prefetching is a simple hardware controlled pre fetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality.  In its simplest for, the number of prefetched blocks on each miss is fixed throughout the execution&amp;lt;ref&amp;gt;http://129.16.20.23/~pers/pub/j5.pdf&amp;lt;/ref&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty.  Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss.  Prefetching approaches proposed in the literature are software or hardware based.  &lt;br /&gt;
&lt;br /&gt;
Software controlled prefetching schemes rely on the programmer/compiler to insert prefetch instructions prior to the instructions that trigger a miss.  In addition, both the processor and the memory system must be able to support prefetch instructions which can potentially increase the code size and the run-time overhead.  By contrast, hardware-controlled prefetch relieve the programmer/compiler from the burden of deciding what and when to prefetch.  Usually, these schemes take advantage of the regularity of data access in scientific computations by dynamically detecting access strides.&lt;br /&gt;
&lt;br /&gt;
Basically there are two types of prefetching techniques&lt;br /&gt;
&lt;br /&gt;
1. Fixed sequential prefetching&lt;br /&gt;
&lt;br /&gt;
2. Adaptive sequential prefetching&lt;br /&gt;
&lt;br /&gt;
More details about these is discussed in the following sections along with other prefetching models:&lt;br /&gt;
&lt;br /&gt;
== Simulated processing Node Architecture ==&lt;br /&gt;
According to Fig. 2.1, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on write misses and is blocking on read misses. Writes and read miss requests are buffered in the FLWB. The second-level cache (SLC) is a direct-mapped write-back cache. For both prefetching techniques, we only prefetch into the SLC.  In addition, a second-level write buffer (SLWB) keeps track of outstanding requests (SLC read miss, prefetch, and write requests). No more than one request to the same block is allowed to be issued to the system; others are just kept in the SLWB while waiting for the pending request to that block to complete. Moreover, a read miss request may bypass write requests if they are for different blocks.&lt;br /&gt;
[[File:Procenv.png|thumb|center|upright|350px|Figure 2.1. Processor environment and simulated architecture]]&lt;br /&gt;
&lt;br /&gt;
==Fixed Sequential Prefetching&amp;lt;ref&amp;gt;http://www.springerlink.com/content/lu0755310187318n/&amp;lt;/ref&amp;gt;==&lt;br /&gt;
By fixed sequential prefetching we mean that K consecutive blocks are prefetched into the SLC on a reference to a block, i.e., blocks n + 1 ... n + K are prefetched upon a reference to block n, if they are not present in the cache. Sequential prefetching has been extensively studied in the context of uniprocessors,but to our knowledge, have never been considered for general applications on multiprocessors. Although many sequential strategies have been proposed for uniprocessors, we have restricted ourselves to prefetching on a miss in the SLC. When a reference misses in the SLC, the miss request is sent to memory, and the cache is searched for the K consecutive blocks directly following the missing block in the address space. The blocks among the K consecutive blocks that are not present in the SLC and have no pending requests in the SLWB are prefetched. We refer to K as the degree of prefetching.&lt;br /&gt;
[[File:untitled1.png|thumb|center|upright|350px|Figure 2.2. The fixed sequential prefetching mechanism]]&lt;br /&gt;
&lt;br /&gt;
Fig. 2.2 shows the mechanism of the fixed sequential prefetching scheme. As a cache lookup is made for block address n, the next block address &lt;br /&gt;
(n + 1) is calculated. On a read miss, a read request is issued to the memory system and is kept in the SLWB. In the next cache cycle, the calculated address (n + I) is directed to the cache, and a cache lookup is made. If the block is not present in the cache, a prefetch request is issued and is kept in the SLWB. During that time, the subsequent block address is calculated (n + 2). The number of iterations is determined by the degree of prefetching. The processor is blocked only during the time it takes to handle the first read miss. Since the prefetch requests are issued one at a time and are pipelined in the memory system, they can be overlapped with the original read request. Besides the simple extensions in the SLC to incorporate fixed sequential prefetching, the memory system must be able to handle three new network commands: a prefetch request and two reply messages denoted PreData and PreNeg. Whereas PreData carries the prefetched block, PreNeg tells the cache that the prefetch request cannot be satisfied because the memory copy is in a transient state-some other cache is reading or writing to it.&lt;br /&gt;
&lt;br /&gt;
==Adaptive Sequential Prefetching&amp;lt;ref&amp;gt;http://web.cecs.pdx.edu/~walpole/papers/mmcn1998b.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
The mechanism behind the adaptive scheme is basically the same as that of fixed sequential prefetching. For example, prefetching is activated by a read miss and blocks are prefetched into the SLC. In contrast to fixed sequential prefetching, however, the degree of prefetching is not fixed; rather it is controlled&lt;br /&gt;
by a register, the Lookahead Counter. The adaptive sequential prefetching scheme relies on adjusting the degree of prefetching (the value of the Lookahead- Counter) dynamically by counting the useful prefetches, i.e., prefetched blocks that are actually referenced during their lifetime in the cache. To explain how this is achieved, we will first focus on how the algorithm measures the prefetch efficiency and then how the Lookahead Counter is adjusted to a certain prefetch efficiency. The mechanisms needed to achieve these task-two bits per cache line and three counters per cache appear in Table 1.&lt;br /&gt;
&lt;br /&gt;
{|class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|PrefetchBit (per Cache Line)&lt;br /&gt;
|Used to detect useful prefetches (needed when prefetching is tumed on.)&lt;br /&gt;
|-&lt;br /&gt;
|ZeroBit (per cache line)&lt;br /&gt;
|Used to detect when a prefetch would have been useful (needed when prefetching is turned off.)&lt;br /&gt;
|-&lt;br /&gt;
|LookaheadCounter (per cache)&lt;br /&gt;
|The current degree of prefetching (per cache)&lt;br /&gt;
|-&lt;br /&gt;
|PrefetchCounter (per cache)&lt;br /&gt;
|Counts the number of prefetches that have been I returned after each read miss&lt;br /&gt;
|-&lt;br /&gt;
|UsefulCounter (per cache)&lt;br /&gt;
|Counts the number of useful prefetches&lt;br /&gt;
|}&lt;br /&gt;
Conceptually, the algorithm measures the prefetch efficiency by counting the fraction of prefetched blocks that are referenced by the processors. If this fraction exceeds a preset threshold, the degree of prefetching is increased and, if it is below another preset threshold, the degree of prefetching is decreased.&lt;br /&gt;
&lt;br /&gt;
The basic mechanisms used to measure the prefetch efficiency consist of two counters (the PrefetchCounter and the UsefulCounter) and a PrefetchBit per cache line which are all cleared from the very beginning. The fraction of useful prefetches is established as the ratio of the UsefulCounter and the PrefetchCounter as follows. The number of prefetched blocks is counted by incrementing the PrefetchCounter whenever a prefetch acknowledgment is received from the memory system, independent of whether the prefetch was accepted (PreData) or not (PreNeg) (e.g., if the memory block was in a transient state and neither clean nor dirty) by the memory system. To count the number of prefetched blocks that are referenced, the PrefetchBit of a prefetched block is set; when a block is accessed with its PrefetchBit set, the Usefulcounter is incremented and the PrefetchBit is cleared.&lt;br /&gt;
&lt;br /&gt;
Every time the PrefetchCounter reaches its maximum (i.e., it wraps around), the value of the Usefulcounter is matched against two preset thresholds to determine if the Lookahead-Counter-initially set to one-should be changed. If the Useful- Counter exceeds the upper threshold, we are in a phase of execution where the program could benefit from a higher degree of prefetching and therefore the LookaheadCounter is incremented. If the Usefulcounter is lower than the lower preset threshold, the amount of prefetching is too high and the LookaheadCounter is&lt;br /&gt;
decremented. Finally, if the Usefulcounter has a value between the two thresholds, the LookaheadCounter is not affected. In all cases, the Usefulcounter is cleared. In our evaluation, we have considered counters modulo 16 (4 bits).&lt;br /&gt;
&lt;br /&gt;
When the LookaheadCounter reaches zero, prefetching is turned off. To turn it back on, we use the following mechanism. When a block is received on a read miss and prefetching is turned off, the ZeroBit in the corresponding SLC block frame, which is initially cleared, is set to indicate that the following block in the address space could have been prefetched&lt;br /&gt;
and the PrefetchCounter is incremented. On a read miss, a cache lookup is made to the previous block (by address); if it&lt;br /&gt;
hits and the ZeroBit is set, the UsefulCounter is incremented and the ZeroBit is cleared. The ZeroBit of a block is also&lt;br /&gt;
cleared when the block is accessed and the LookaheadCounter is not zero to keep the number of ZeroBits that have been previously set to a minimum.&lt;br /&gt;
&lt;br /&gt;
==Chip Multiprocessing Prefetching (CMP)==&lt;br /&gt;
[[File:PrefetchImp.png|thumb|right|upright|300px|Figure 2.3. Prefetch Implementation]]&lt;br /&gt;
Prefetching the lowest miss address stream in the cache hierarchy has many advantages, particularly in a CMP system. First, in a CMP, the L2 cache is often shared by all processors on the chip. Consequently, prefetching the L2 miss address stream can share prefetch history among the processors, resulting in larger history tables. Second, prefetching L2 miss addresses reduces contention on the cache ports, which is becoming increasingly important as the number of processors per chip grows. Before a prefetch is sent to the memory subsystem, it must access the L2 directory. Since the L2 miss address stream has the fewest memory references it will generate less prefetches and access the cache ports less often. Last, prefetching into the L1 is relatively insignificant, since modern out-of-order processors can tolerate most L1 data cache misses with relatively little performance degradation. Prefetching in a CMP is more difficult than in a uniprocessor system. In addition to limited bandwidth and increased latency (as described earlier), cache coherency protocols play an important role in CMP prefetching.&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching==&lt;br /&gt;
&lt;br /&gt;
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74483</id>
		<title>CSC/ECE 506 Spring 2013/10a os</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/10a_os&amp;diff=74483"/>
		<updated>2013-04-01T23:46:17Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: Created page with &amp;quot;'''Prefetching and Memory Consistency Models'''  == '''Overview''' == This wiki article explores two different topics Sequential Prefetching and Memory Consistency models.  The a...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''Prefetching and Memory Consistency Models'''&lt;br /&gt;
&lt;br /&gt;
== '''Overview''' ==&lt;br /&gt;
This wiki article explores two different topics Sequential Prefetching and Memory Consistency models.  The article covers description of prefetching, different types of prefetching like Fixed, Adaptive, etc explained in detail. This is followed by, different types of memory consistency models like Sequential consistency and Relaxed consistency.  It also talks about the authors' and researchers' comments through examples.&lt;br /&gt;
&lt;br /&gt;
= '''Prefetching''' =&lt;br /&gt;
&lt;br /&gt;
Sequential prefetching is a simple hardware controlled pre fetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality.  In its simplest for, the number of prefetched blocks on each miss is fixed throughout the execution&amp;lt;ref&amp;gt;http://129.16.20.23/~pers/pub/j5.pdf&amp;lt;/ref&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
Prefetching is a common technique to reduce the read miss penalty.  Prefetching relies on predicting which blocks currently missing in the cache will be read in the future and on bringing these blocks into the cache prior to the reference triggering the miss.  Prefetching approaches proposed in the literature are software or hardware based.  &lt;br /&gt;
&lt;br /&gt;
Software controlled prefetching schemes rely on the programmer/compiler to insert prefetch instructions prior to the instructions that trigger a miss.  In addition, both the processor and the memory system must be able to support prefetch instructions which can potentially increase the code size and the run-time overhead.  By contrast, hardware-controlled prefetch relieve the programmer/compiler from the burden of deciding what and when to prefetch.  Usually, these schemes take advantage of the regularity of data access in scientific computations by dynamically detecting access strides.&lt;br /&gt;
&lt;br /&gt;
Basically there are two types of prefetching techniques&lt;br /&gt;
&lt;br /&gt;
1. Fixed sequential prefetching&lt;br /&gt;
&lt;br /&gt;
2. Adaptive sequential prefetching&lt;br /&gt;
&lt;br /&gt;
More details about these is discussed in the following sections along with other prefetching models:&lt;br /&gt;
&lt;br /&gt;
== Simulated processing Node Architecture ==&lt;br /&gt;
According to Fig. 2.1, the processing node consists of a processor, a first-level cache (FLC), a second-level cache (SLC), a first- and second-level write buffer (FLWB and SLWB), a local bus, a network interface controller, and a memory module. The FLC is a direct-mapped write-through cache with no allocation of blocks on write misses and is blocking on read misses. Writes and read miss requests are buffered in the FLWB. The second-level cache (SLC) is a direct-mapped write-back cache. For both prefetching techniques, we only prefetch into the SLC.  In addition, a second-level write buffer (SLWB) keeps track of outstanding requests (SLC read miss, prefetch, and write requests). No more than one request to the same block is allowed to be issued to the system; others are just kept in the SLWB while waiting for the pending request to that block to complete. Moreover, a read miss request may bypass write requests if they are for different blocks.&lt;br /&gt;
[[File:Procenv.png|thumb|center|upright|350px|Figure 2.1. Processor environment and simulated architecture]]&lt;br /&gt;
&lt;br /&gt;
==Fixed Sequential Prefetching&amp;lt;ref&amp;gt;http://www.springerlink.com/content/lu0755310187318n/&amp;lt;/ref&amp;gt;==&lt;br /&gt;
By fixed sequential prefetching we mean that K consecutive blocks are prefetched into the SLC on a reference to a block, i.e., blocks n + 1 ... n + K are prefetched upon a reference to block n, if they are not present in the cache. Sequential prefetching has been extensively studied in the context of uniprocessors,but to our knowledge, have never been considered for general applications on multiprocessors. Although many sequential strategies have been proposed for uniprocessors, we have restricted ourselves to prefetching on a miss in the SLC. When a reference misses in the SLC, the miss request is sent to memory, and the cache is searched for the K consecutive blocks directly following the missing block in the address space. The blocks among the K consecutive blocks that are not present in the SLC and have no pending requests in the SLWB are prefetched. We refer to K as the degree of prefetching.&lt;br /&gt;
[[File:untitled1.png|thumb|center|upright|350px|Figure 2.2. The fixed sequential prefetching mechanism]]&lt;br /&gt;
&lt;br /&gt;
Fig. 2.2 shows the mechanism of the fixed sequential prefetching scheme. As a cache lookup is made for block address n, the next block address &lt;br /&gt;
(n + 1) is calculated. On a read miss, a read request is issued to the memory system and is kept in the SLWB. In the next cache cycle, the calculated address (n + I) is directed to the cache, and a cache lookup is made. If the block is not present in the cache, a prefetch request is issued and is kept in the SLWB. During that time, the subsequent block address is calculated (n + 2). The number of iterations is determined by the degree of prefetching. The processor is blocked only during the time it takes to handle the first read miss. Since the prefetch requests are issued one at a time and are pipelined in the memory system, they can be overlapped with the original read request. Besides the simple extensions in the SLC to incorporate fixed sequential prefetching, the memory system must be able to handle three new network commands: a prefetch request and two reply messages denoted PreData and PreNeg. Whereas PreData carries the prefetched block, PreNeg tells the cache that the prefetch request cannot be satisfied because the memory copy is in a transient state-some other cache is reading or writing to it.&lt;br /&gt;
&lt;br /&gt;
==Adaptive Sequential Prefetching&amp;lt;ref&amp;gt;http://web.cecs.pdx.edu/~walpole/papers/mmcn1998b.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
The mechanism behind the adaptive scheme is basically the same as that of fixed sequential prefetching. For example, prefetching is activated by a read miss and blocks are prefetched into the SLC. In contrast to fixed sequential prefetching, however, the degree of prefetching is not fixed; rather it is controlled&lt;br /&gt;
by a register, the Lookahead Counter. The adaptive sequential prefetching scheme relies on adjusting the degree of prefetching (the value of the Lookahead- Counter) dynamically by counting the useful prefetches, i.e., prefetched blocks that are actually referenced during their lifetime in the cache. To explain how this is achieved, we will first focus on how the algorithm measures the prefetch efficiency and then how the Lookahead Counter is adjusted to a certain prefetch efficiency. The mechanisms needed to achieve these task-two bits per cache line and three counters per cache appear in Table 1.&lt;br /&gt;
&lt;br /&gt;
{|class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|PrefetchBit (per Cache Line)&lt;br /&gt;
|Used to detect useful prefetches (needed when prefetching is tumed on.)&lt;br /&gt;
|-&lt;br /&gt;
|ZeroBit (per cache line)&lt;br /&gt;
|Used to detect when a prefetch would have been useful (needed when prefetching is turned off.)&lt;br /&gt;
|-&lt;br /&gt;
|LookaheadCounter (per cache)&lt;br /&gt;
|The current degree of prefetching (per cache)&lt;br /&gt;
|-&lt;br /&gt;
|PrefetchCounter (per cache)&lt;br /&gt;
|Counts the number of prefetches that have been I returned after each read miss&lt;br /&gt;
|-&lt;br /&gt;
|UsefulCounter (per cache)&lt;br /&gt;
|Counts the number of useful prefetches&lt;br /&gt;
|}&lt;br /&gt;
Conceptually, the algorithm measures the prefetch efficiency by counting the fraction of prefetched blocks that are referenced by the processors. If this fraction exceeds a preset threshold, the degree of prefetching is increased and, if it is below another preset threshold, the degree of prefetching is decreased.&lt;br /&gt;
&lt;br /&gt;
The basic mechanisms used to measure the prefetch efficiency consist of two counters (the PrefetchCounter and the UsefulCounter) and a PrefetchBit per cache line which are all cleared from the very beginning. The fraction of useful prefetches is established as the ratio of the UsefulCounter and the PrefetchCounter as follows. The number of prefetched blocks is counted by incrementing the PrefetchCounter whenever a prefetch acknowledgment is received from the memory system, independent of whether the prefetch was accepted (PreData) or not (PreNeg) (e.g., if the memory block was in a transient state and neither clean nor dirty) by the memory system. To count the number of prefetched blocks that are referenced, the PrefetchBit of a prefetched block is set; when a block is accessed with its PrefetchBit set, the Usefulcounter is incremented and the PrefetchBit is cleared.&lt;br /&gt;
&lt;br /&gt;
Every time the PrefetchCounter reaches its maximum (i.e., it wraps around), the value of the Usefulcounter is matched against two preset thresholds to determine if the Lookahead-Counter-initially set to one-should be changed. If the Useful- Counter exceeds the upper threshold, we are in a phase of execution where the program could benefit from a higher degree of prefetching and therefore the LookaheadCounter is incremented. If the Usefulcounter is lower than the lower preset threshold, the amount of prefetching is too high and the LookaheadCounter is&lt;br /&gt;
decremented. Finally, if the Usefulcounter has a value between the two thresholds, the LookaheadCounter is not affected. In all cases, the Usefulcounter is cleared. In our evaluation, we have considered counters modulo 16 (4 bits).&lt;br /&gt;
&lt;br /&gt;
When the LookaheadCounter reaches zero, prefetching is turned off. To turn it back on, we use the following mechanism. When a block is received on a read miss and prefetching is turned off, the ZeroBit in the corresponding SLC block frame, which is initially cleared, is set to indicate that the following block in the address space could have been prefetched&lt;br /&gt;
and the PrefetchCounter is incremented. On a read miss, a cache lookup is made to the previous block (by address); if it&lt;br /&gt;
hits and the ZeroBit is set, the UsefulCounter is incremented and the ZeroBit is cleared. The ZeroBit of a block is also&lt;br /&gt;
cleared when the block is accessed and the LookaheadCounter is not zero to keep the number of ZeroBits that have been previously set to a minimum.&lt;br /&gt;
&lt;br /&gt;
==Chip Multiprocessing Prefetching (CMP)==&lt;br /&gt;
[[File:PrefetchImp.png|thumb|right|upright|300px|Figure 2.3. Prefetch Implementation]]&lt;br /&gt;
Prefetching the lowest miss address stream in the cache hierarchy has many advantages, particularly in a CMP system. First, in a CMP, the L2 cache is often shared by all processors on the chip. Consequently, prefetching the L2 miss address stream can share prefetch history among the processors, resulting in larger history tables. Second, prefetching L2 miss addresses reduces contention on the cache ports, which is becoming increasingly important as the number of processors per chip grows. Before a prefetch is sent to the memory subsystem, it must access the L2 directory. Since the L2 miss address stream has the fewest memory references it will generate less prefetches and access the cache ports less often. Last, prefetching into the L1 is relatively insignificant, since modern out-of-order processors can tolerate most L1 data cache misses with relatively little performance degradation. Prefetching in a CMP is more difficult than in a uniprocessor system. In addition to limited bandwidth and increased latency (as described earlier), cache coherency protocols play an important role in CMP prefetching.&lt;br /&gt;
&lt;br /&gt;
==Disadvantages of Prefetching==&lt;br /&gt;
&lt;br /&gt;
1. Increased Complexity and overhead of handling the perfetching algorithms.  Performance must be improved significantly to counterbalance this overhead and complexity or the efforts are not worthwhile.&lt;br /&gt;
&lt;br /&gt;
2. With multiple cores, prefetching requests can originate from a variety of different cores. This puts additional stress on memory to not only deal with regular prefetch requests but also to handle prefetch from different sources, thus greatly increasing the overhead and complexity of logic. &lt;br /&gt;
&lt;br /&gt;
3. If prefetched data is stored in the data cache, then cache conflict or cache pollution, can become a significant burden.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= External links =&lt;br /&gt;
1. [http://129.16.20.23/~pers/pub/j5.pdf Sequential hardware prefetching in shared-memory multiprocessors]&lt;br /&gt;
&lt;br /&gt;
2. [http://titanium.cs.berkeley.edu/papers/kamil-su-yelick-sc05.pdf Making Sequential Consistency Practical in Titanium]&lt;br /&gt;
&lt;br /&gt;
3. [http://static.usenix.org/event/usenix08/tech/full_papers/baek/baek.pdf Prefetching with Adaptive Cache Culling for Striped Disk Arrays]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=669044&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D669044 An adaptive network prefetch scheme]&lt;br /&gt;
&lt;br /&gt;
5. [http://www.csl.cornell.edu/courses/ee572/gharachorloo.icpp91.pdf Two Techniques to Enhance the Performance of Memory Consistency Models]&lt;br /&gt;
&lt;br /&gt;
6. [http://research.cs.wisc.edu/multifacet/papers/computer98_sccase_pdf.pdf Multiprocessors Should Support Simple Memory Consistency Models]&lt;br /&gt;
&lt;br /&gt;
7. [http://www.ee.ryerson.ca/~courses/ee8207/prefetchprj3.pdf Hardware Prefetching Schemes]&lt;br /&gt;
&lt;br /&gt;
8. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.1422&amp;amp;rep=rep1&amp;amp;type=pdf Sequential Prefetching in Shared Memory Processors ]&lt;br /&gt;
&lt;br /&gt;
9. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.1229&amp;amp;rep=rep1&amp;amp;type=pdf Software Controlled prefetching in Shared Memory Processors]&lt;br /&gt;
&lt;br /&gt;
10. [http://www.cs.cmu.edu/~tcm/thesis/subsection2_10_3_2.html Relaxed consistency models]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=74482</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=74482"/>
		<updated>2013-04-01T22:46:29Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/1a_ag]]&lt;br /&gt;
* Chapter 3a [[CSC/ECE_506_Spring_2013/3a_bs]]&lt;br /&gt;
* Chapter 6a [[CSC/ECE_506_Spring_2013/6a_cs]]&lt;br /&gt;
* Chapter 5a [[CSC/ECE_506_Spring_2013/5a_ks]]&lt;br /&gt;
* Chapter 8a [[CSC/ECE_506_Spring_2013/8a_an]]&lt;br /&gt;
* Chapter 7a [[CSC/ECE_506_Spring_2013/7a_bs]]&lt;br /&gt;
* Chapter 8b [[CSC/ECE_506_Spring_2013/8b_ap]]&lt;br /&gt;
* Chpater 8c [[CSC/ECE_506_Spring_2013/8c_da]]&lt;br /&gt;
* Chpater 10a [[CSC/ECE_506_Spring_2013/10a_os]]&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73603</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73603"/>
		<updated>2013-02-24T08:01:37Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Snoopy Cache Coherence Schemes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache Coherence] (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a [http://en.wikipedia.org/wiki/Shared_memory Shared memory] multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Cache_invalidation Invalidation]: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#[http://en.wikipedia.org/wiki/Bus_sniffing Snooping]g&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes&amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;===&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73602</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73602"/>
		<updated>2013-02-24T07:59:46Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Cache Coherence Support */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache Coherence] (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a [http://en.wikipedia.org/wiki/Shared_memory Shared memory] multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Cache_invalidation Invalidation]: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#Snooping&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes&amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;===&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73601</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73601"/>
		<updated>2013-02-24T07:58:17Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Cache Coherence Support */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache Coherence] (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Cache_invalidation Invalidation]: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#Snooping&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes&amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;===&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73600</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73600"/>
		<updated>2013-02-24T07:56:57Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Cache Coherence Schemes – Fetch and Replacements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;Cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Cache_invalidation Invalidation]: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#Snooping&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes&amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;===&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73599</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73599"/>
		<updated>2013-02-24T07:51:11Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Write Through Schemes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;Cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* Invalidation: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#Snooping&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes&amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;===&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73598</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73598"/>
		<updated>2013-02-24T07:51:00Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Write-Back/Ownership Schemes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;Cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* Invalidation: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#Snooping&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
#In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes&amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;===&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73597</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73597"/>
		<updated>2013-02-24T07:50:24Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Write-Back/Ownership Schemes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;Cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* Invalidation: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#Snooping&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
#In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
#In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes=== &amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73596</id>
		<title>CSC/ECE 506 Spring 2013/6a cs</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/6a_cs&amp;diff=73596"/>
		<updated>2013-02-24T07:50:01Z</updated>

		<summary type="html">&lt;p&gt;Scanjee: /* Pointer-Based Coherence Schemes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Cache Hierarchy=&lt;br /&gt;
[[Image: memchart.jpg|thumbnail|300px|right|Memory Hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;9body&amp;quot;&amp;gt;[[#9foot|[9]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
In a simple computer model, processor reads data and instructions from the memory and operates on the data.  Operating frequency of CPU increased faster than the speed of memory and memory interconnects.  For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency.  Also, multi-core architecture started putting more demand on memory bandwidth.  This increases the latency in memory access and CPU will have to be idle for most of the time.  Due to this, memory became a bottle neck in performance.   &lt;br /&gt;
&lt;br /&gt;
To solve this problem, “cache” was invented.  Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency.  CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles.  In early days of computing, cache was implemented as a stand alone chip outside the processor.  In today’s processors, cache is implemented on same die as core.  &lt;br /&gt;
&lt;br /&gt;
There can be multiple levels of caches, each cache subsequently away from the core and larger in size.  L1 is closest to the CPU and as a result, fastest to excess.  Next to L1 is L2 cache and then L3.  L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#2foot|[2]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
{| border='1' class=&amp;quot;wikitable&amp;quot; style=&amp;quot;text-align:center&amp;quot;&lt;br /&gt;
|+style=&amp;quot;white-space:nowrap&amp;quot;|Table 1: Cache on different Microprocessors&lt;br /&gt;
|-&lt;br /&gt;
! Company &amp;amp; Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache  !! Year&lt;br /&gt;
|-&lt;br /&gt;
| Intel Pentium Dual Core || 2 || I:32KB   D:32KB || 1MB 8 way set assoc.  || -  ||  2006&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Clovertown || 2 || I:4*32KB   D:4*32KB || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon 3200 series || 4 || - || 2*4MB || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64FX || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 1MB 16way set assoc. || - || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Athlon 64X2 || 2 || I:64KB     D:4KB  Both 2-way set assoc. || 512KB/1MB 16way set assoc. || 2MB || Jan 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Barcelona || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Aug 2007&lt;br /&gt;
|-&lt;br /&gt;
| Sun Microsystems Ultra Sparc T2 || 8 ||  I:16KB   D:8KB || 4MB (8 banks) 16way set assoc. || - || Oct 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Wolfdale DP || 2 ||  D:96KB  || 6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| Intel Xeon Hapertown || 4 ||  D:96KB  || 2*6MB || - || Nov 2007&lt;br /&gt;
|-&lt;br /&gt;
| AMD Phenom || 4 || I:64KB   D:64KB  || 512KB || 2MB Shared || Nov 2007    Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Duo || 2 ||  I:32KB   D:32KB  || 2/4MB 8 way set assoc. || - ||  2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Penryn Wolfdale DP || 4 ||  -  || 6-12MB || 6MB || Mar 2008     Aug 2008&lt;br /&gt;
|-&lt;br /&gt;
| Intel Core 2 Quad Yorkfield || 4 ||  D:96KB  || 12MB  || - ||  Mar 2008&lt;br /&gt;
|-&lt;br /&gt;
| AMD Toliman || 3K10 || I:64KB   D:64KB  || 512KB || 2MB Shared || Apr 2008&lt;br /&gt;
|-&lt;br /&gt;
| Azul Systems Vega3 7300 Series || 864 || 768GB  || - || - || May 2008&lt;br /&gt;
|-&lt;br /&gt;
| IBM RoadRunner || 8+1 || 32KB  || 512KB || - || Jun 2008&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies=&lt;br /&gt;
&lt;br /&gt;
In section 6.2.3&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.&lt;br /&gt;
The write miss policies covered in the text&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;10body&amp;quot;&amp;gt;[[#10foot|[10]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;, write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
==Write hit policies==&lt;br /&gt;
[[Image:Cache_policy_comparison.jpg|thumbnail|600px|Cache policy comparison&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
*Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Easy to implement&lt;br /&gt;
***Main memory has most recent copy of the data&lt;br /&gt;
***Read misses never result in writes to main memory&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Every write needs to access main memory&lt;br /&gt;
***Bandwidth intensive&lt;br /&gt;
***Writes are slower&lt;br /&gt;
*Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.&lt;br /&gt;
**Advantages:&lt;br /&gt;
***Writes are as fast as the speed of the cache memory&lt;br /&gt;
***Multiple writes to a block require one write to main memory&lt;br /&gt;
***Less bandwidth intensive&lt;br /&gt;
**Disadvantages:&lt;br /&gt;
***Harder to implement&lt;br /&gt;
***Main memory may not be consistent with cache&lt;br /&gt;
***Reads that result in data replacement may cause dirt blocks to be written to main memory&lt;br /&gt;
&lt;br /&gt;
==Write miss policies==&lt;br /&gt;
Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus. &lt;br /&gt;
&lt;br /&gt;
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block.  For write-allocate, there will be a line in the cache for the written data.  This policy is typically associated with write-back caches.  For no-write-allocate, there will not be a line in the cache.&lt;br /&gt;
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses.  The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
*Write-before-hit vs no-write-before-hit:  The write-before-hit will write data to the cache before checking the cache tags for a match.  In case of a miss, the policy will displace the block of data already in the cache.&lt;br /&gt;
&lt;br /&gt;
==Combination Policies==&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.&lt;br /&gt;
&lt;br /&gt;
* Write-validate: It is a combination of no-fetch-on-write and write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  The policy allows partial lines to be written to the cache on a miss.  It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run.  Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
* Write-around:  Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.  This policy uses a non-blocking write scheme to write to cache.  It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.&lt;br /&gt;
&lt;br /&gt;
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache.  But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory.  Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this).  Prefetching is a technique in which data is brought into cache before the program needs it.  In other words, it is a way to reduce cache misses.  Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache.  It is not guaranteed that the prefetched data will be used.  Goal here is to reduce cache misses to improve overall performance.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Some architectures have instructions to prefetch data into cache.  Programmers and compliers can insert this prefect instruction in the code.  This is known as software prefetching.  In hardware prefetching, processor observers the system behavior and issues requests for prefetching.  Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch.   Graphics Processing Units benefit from prefetching due to spatial locality property of the data&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;8body&amp;quot;&amp;gt;[[#8foot|[8]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.   &lt;br /&gt;
&lt;br /&gt;
==Advantages==&lt;br /&gt;
*Improves overall performance by reducing cache misses.&lt;br /&gt;
==Disadvantages==&lt;br /&gt;
* Wastes bandwidth when prefetched data is not used.&lt;br /&gt;
* Hardware prefetching requires complex architecture.  Second order effect is cost of implementation on silicon and validation costs.&lt;br /&gt;
* Software prefetching adds additional instructions to the program, making the program larger.&lt;br /&gt;
* If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted.  If the evicted blocks are needed, then that will generate a cache miss.  This can be prevented by having a separate cache for prefetching but it comes with hardware costs.&lt;br /&gt;
* When scheduler changes the task running on a processor, prefetched data may become useless.&lt;br /&gt;
&lt;br /&gt;
==Effectiveness==&lt;br /&gt;
Prefetching effectiveness can be tracked by following matrices&lt;br /&gt;
# Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.&lt;br /&gt;
# Accuracy is defined as fraction of prefetches that are useful.&lt;br /&gt;
# Timeliness measures how early the prefetches arrive.&lt;br /&gt;
Ideally, a system should have high coverage, high accuracy and optimum timeliness.  Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa.  Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.&lt;br /&gt;
&lt;br /&gt;
==Stream Buffer Prefetching==&lt;br /&gt;
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit.  System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel.  On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head.  If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer.   Only the heads of the stream buffers are checked during cache access and not the whole buffer.  Checking all the entries in all the buffers will increase hardware complexity.&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
[[Image:Cache_hit_improvements.jpg|center]]&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs.  Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;7body&amp;quot;&amp;gt;[[#7foot|[7]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Parallel Computing==&lt;br /&gt;
On a uniprocessor system, prefetching is definitely helpful to improve performance.  On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores.  In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
In shared memory parallel programming, multiple threads that run on different processors share common memory space.  If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system.  Difficulties arise when each core has its own cache.  Some of the case-scenarios that can occur are:&lt;br /&gt;
# Processor P1 has prefetched some data D1 into its stream buffer but is not used it.  At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change.  D1 is not in P1’s cache so it many simply ignore this.  Now, when P1 ties to read D1, it will get the stale data from its stream buffer.  One way to prevent this is by improving stream buffers so that they can modify their data just like a cache.  This adds complexity to the architecture and increases cost&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;6body&amp;quot;&amp;gt;[[#6foot|[6]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;.&lt;br /&gt;
# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer.  Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data.  Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it.  There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.&lt;br /&gt;
# In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores.  This will require the prefetched buffers to be trashed.&lt;br /&gt;
&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
==Prefetching in Intel Core i7&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;11body&amp;quot;&amp;gt;[[#11foot|[11]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
==Prefetching in AMD&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;12body&amp;quot;&amp;gt;[[#12foot|[12]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Support=&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt;Cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.&lt;br /&gt;
&lt;br /&gt;
Cache coherence is achieved if the following conditions are met.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and  reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.&lt;br /&gt;
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.  &lt;br /&gt;
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.&lt;br /&gt;
&lt;br /&gt;
==Software vs. Hardware solutions==&lt;br /&gt;
&lt;br /&gt;
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;ref&amp;gt;[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon&amp;lt;/ref&amp;gt;Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which  software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Schemes – Fetch and Replacements==&lt;br /&gt;
&lt;br /&gt;
===Invalidation Schemes vs. Update Strategies&amp;lt;ref&amp;gt;[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas&amp;lt;/ref&amp;gt;===&lt;br /&gt;
&lt;br /&gt;
* Invalidation: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.&lt;br /&gt;
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.&lt;br /&gt;
&lt;br /&gt;
The strategies followed for fetch and replacement in some of the schemes are discussed below.&lt;br /&gt;
&lt;br /&gt;
===Snoopy Cache Coherence Schemes===&lt;br /&gt;
#Snooping&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]&amp;lt;/ref&amp;gt; is the process where the individual caches monitor address lines for accesses to memory locations that they have cached.  When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.&lt;br /&gt;
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.&lt;br /&gt;
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.&lt;br /&gt;
&lt;br /&gt;
===Directory Based Cache Coherence Schemes===&lt;br /&gt;
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.&lt;br /&gt;
#Fetch and Replacement Scenarios&amp;lt;ref&amp;gt;[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti&amp;lt;/ref&amp;gt;:&lt;br /&gt;
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. &lt;br /&gt;
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.&lt;br /&gt;
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.&lt;br /&gt;
##Block is Shared: The block requested by a Processor P1 is in shared state.&lt;br /&gt;
##*Read Miss:  Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.&lt;br /&gt;
##*Write Miss:  Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.&lt;br /&gt;
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.&lt;br /&gt;
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.&lt;br /&gt;
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied&lt;br /&gt;
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.&lt;br /&gt;
&lt;br /&gt;
===Write Through Schemes===&lt;br /&gt;
#In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.&lt;br /&gt;
&lt;br /&gt;
===Write-Back/Ownership Schemes===&lt;br /&gt;
#In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. &lt;br /&gt;
&lt;br /&gt;
===Pointer-Based Coherence Schemes===&amp;lt;ref&amp;gt;[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry&amp;lt;/ref&amp;gt;&lt;br /&gt;
#The Full Bit Vector Schemes&lt;br /&gt;
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.&lt;br /&gt;
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.&lt;br /&gt;
#Limited Pointer Schemes&lt;br /&gt;
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.&lt;br /&gt;
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherency protocol==&lt;br /&gt;
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [[distributed shared memory]]. The protocol maintains [[memory coherence]] according to a specific [[consistency model]]. Older multiprocessors support the [[sequential consistency]] model, while modern shared memory systems typically support the [[release consistency]] or [[weak consistency]] models.&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
&lt;br /&gt;
Various models and protocols have been devised for maintaining coherence, such as [[MSI_protocol|MSI]], [[MESI_protocol|MESI]] (aka Illinois), [[MOSI_protocol|MOSI]], [[MOESI_protocol|MOESI]], [[MERSI_protocol|MERSI]], [[MESIF_protocol|MESIF]], [[Write-once (cache coherence)|write-once]], and [[Synapse protocol|Synapse]], [[Berkeley protocol|Berkeley]], [[Firefly protocol|Firefly]] and [[Dragon protocol]]'''.&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; Computer Design &amp;amp; Technology- Lectures slides by Prof.Eric Rotenberg  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt; “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt; Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;6foot&amp;quot;&amp;gt;[[#6body|6.]]&amp;lt;/span&amp;gt; “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta  (pg 887)    &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;7foot&amp;quot;&amp;gt;[[#7body|7.]]&amp;lt;/span&amp;gt; “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler.   (ref#2) &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;8foot&amp;quot;&amp;gt;[[#8body|8.]]&amp;lt;/span&amp;gt; http://en.wikipedia.org/wiki/Instruction_prefetch &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;9foot&amp;quot;&amp;gt;[[#9body|9.]]&amp;lt;/span&amp;gt; http://www.real-knowledge.com/memory.htm  &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;10foot&amp;quot;&amp;gt;[[#10body|10.]]&amp;lt;/span&amp;gt; Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;11foot&amp;quot;&amp;gt;[[#11body|11.]]&amp;lt;/span&amp;gt; &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;12foot&amp;quot;&amp;gt;[[#12body|12.]]&amp;lt;/span&amp;gt; &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Other References=&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Scanjee</name></author>
	</entry>
</feed>