CSC ECE 506 Spring 2015 6a vs - Revision history

Vgarg2: /* Cache Coherence Support */

2015-03-03T01:18:27Z

Cache Coherence Support

Vgarg2: /* Prefetching */

2015-03-03T00:44:39Z

Prefetching

Vgarg2: /* Combination Policies */

2015-03-03T00:28:36Z

Combination Policies

← Older revision		Revision as of 00:28, 3 March 2015
Line 179:		Line 179:
	* Write-validate: It is a combination of no-fetch-on-write and write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the L1 system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.		* Write-validate: It is a combination of no-fetch-on-write and write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the L1 system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.
	* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy invalidates lines when there is a miss. With this policy, the copy that exists in L1 memory after the write miss differs from the one in the cache. For write hits the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.		* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy invalidates lines when there is a miss. With this policy, the copy that exists in L1 memory after the write miss differs from the one in the cache. For write hits the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.
	* Write-around: ~~Combination~~ of no-fetch-on-write, no-write-allocate, and no-write-before-hit.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the L2 cache, writes directly to the L1 memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from L1 memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.		* Write-around: This policy is a combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the L2 cache, writes directly to the L1 memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from L1 memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.

	Of these three combinations, write-validate performed the worse. The author notes that it does perform better than fetch-on-write and is easy to implement.		Of these three combinations, write-validate performed the worse. The author notes that it does perform better than fetch-on-write and is easy to implement.

Vgarg2: /* Combination Policies */

2015-03-03T00:27:39Z

Combination Policies

← Older revision		Revision as of 00:27, 3 March 2015
Line 177:		Line 177:
	Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.		Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.

	* Write-validate: It is a combination of no-fetch-on-write and write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the L1 system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model~~, though,~~ a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.		* Write-validate: It is a combination of no-fetch-on-write and write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the L1 system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.
	* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy invalidates lines when there is a miss. With this policy, the copy that exists in L1 memory after the write miss differs from the one in the cache. For write hits~~, though,~~ the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.		* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy invalidates lines when there is a miss. With this policy, the copy that exists in L1 memory after the write miss differs from the one in the cache. For write hits the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.
	* Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the L2 cache, writes directly to the L1 memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from L1 memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.		* Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit.<sup><span id="4body">[[#4foot\|[4]]]</span></sup> This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the L2 cache, writes directly to the L1 memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from L1 memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.

	Of these three combinations, write-validate performed the worse. The author notes~~, though,~~ that it does perform better than fetch-on-write and is easy to implement.		Of these three combinations, write-validate performed the worse. The author notes that it does perform better than fetch-on-write and is easy to implement.

	* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected. This policy was used as the base-line for determining the performance of the other three policy combinations. In general, all three previous combinations exhibited fewer misses than this policy.		* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected. This policy was used as the base-line for determining the performance of the other three policy combinations. In general, all three previous combinations exhibited fewer misses than this policy.

Vgarg2: /* Write miss policies */

2015-03-03T00:23:25Z

Write miss policies

← Older revision		Revision as of 00:23, 3 March 2015
Line 169:		Line 169:
	Write miss policies can help eliminate write misses and therefore reduce bus traffic. This reduces the wait-time of all processors sharing the bus.		Write miss policies can help eliminate write misses and therefore reduce bus traffic. This reduces the wait-time of all processors sharing the bus.

	*Write-allocate vs Write no-allocate: When a write ~~misses~~ in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.		*Write-allocate vs Write no-allocate: When a write miss occurs in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.
	*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write ~~misses~~. The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.		*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write miss occurs. The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.
	*Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.		*Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.



	==Combination Policies==		==Combination Policies==

Vgarg2: /* Cache Write Policies */

2015-03-03T00:20:35Z

Cache Write Policies

← Older revision		Revision as of 00:20, 3 March 2015
Line 141:		Line 141:
	=Cache Write Policies=		=Cache Write Policies=

	In section 6.2.3,<sup><span id="10body">[[#10foot\|[10]]]</span></sup> cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to L1 memory. ~~As review, write~~-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required.		In section 6.2.3,<sup><span id="10body">[[#10foot\|[10]]]</span></sup> cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to L1 memory. Write-through writes data to the cache and memory on a write request. Write-back writes to cache first and to memory only when a flush is required.
	The write miss policies covered in the text,<sup><span id="10body">[[#10foot\|[10]]]</span></sup> write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.		The write miss policies covered in the text,<sup><span id="10body">[[#10foot\|[10]]]</span></sup> write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line.
	Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.		Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.

Vgarg2: /* Cache Hierarchy */

2015-03-03T00:18:29Z

Cache Hierarchy

← Older revision		Revision as of 00:18, 3 March 2015
Line 3:		Line 3:
	=Cache Hierarchy=		=Cache Hierarchy=
	[[Image: memchart.jpg\|thumbnail\|300px\|right\|Memory Hierarchy<sup><span id="9body">[[#9foot\|[9]]]</span></sup>]]		[[Image: memchart.jpg\|thumbnail\|300px\|right\|Memory Hierarchy<sup><span id="9body">[[#9foot\|[9]]]</span></sup>]]
	In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU has increased faster than the speed of memory and memory interconnects. Over the years the increase in the Intel microprocessor's speed has been roughly 60% per year, while memory access times have improved by less than 10% per year. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.		In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU has increased faster than the speed of memory and memory interconnects. Over the years the increase in the Intel microprocessor's speed has been roughly 60% per year, while memory access times have improved by less than 10% per year. Also, multi-core architecture has started putting more demand on the memory bandwidth. This increases the latency in memory access and CPU will have to remain idle for most of the time. Due to this, memory has became a bottle neck in performance.

	To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.		To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on the same die as the core.

	There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to ~~excess~~. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br>		There can be multiple levels of caches, each cache level subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to access. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.<br>

	Cache management is impacted by three characteristics of modern processor architectures: multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.		Cache management is impacted by three characteristics of modern processor architectures: multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.

Vgarg2: /* Cache Hierarchy */

2015-03-03T00:14:01Z

Cache Hierarchy

← Older revision		Revision as of 00:14, 3 March 2015
Line 3:		Line 3:
	=Cache Hierarchy=		=Cache Hierarchy=
	[[Image: memchart.jpg\|thumbnail\|300px\|right\|Memory Hierarchy<sup><span id="9body">[[#9foot\|[9]]]</span></sup>]]		[[Image: memchart.jpg\|thumbnail\|300px\|right\|Memory Hierarchy<sup><span id="9body">[[#9foot\|[9]]]</span></sup>]]
	In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. Over the years the increase in the Intel microprocessor's speed has been roughly 60% per year, while memory access times have improved by less than 10% per year. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.		In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU has increased faster than the speed of memory and memory interconnects. Over the years the increase in the Intel microprocessor's speed has been roughly 60% per year, while memory access times have improved by less than 10% per year. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.

	To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.		To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.

Vgarg2: /* Cache Replacement Policies */

2015-03-02T23:51:11Z

Cache Replacement Policies

← Older revision		Revision as of 23:51, 2 March 2015
Line 201:		Line 201:
	*[https://en.wikipedia.org/wiki/Least_frequently_used Least Frequently Used(LFU)] policy: In LFU policy, counter is maintained on how often the cache line is referenced and evicts the block from least frequently used cache line.		*[https://en.wikipedia.org/wiki/Least_frequently_used Least Frequently Used(LFU)] policy: In LFU policy, counter is maintained on how often the cache line is referenced and evicts the block from least frequently used cache line.

	There are other replacement policies such as Most Recently Used(MRU), Random Replacement(RR), [https://en.wikipedia.org/wiki/Pseudo-LRU Psuedo-LRU(PLRU)], Segmented LRU(SLRU) and so on. Each type of replacement policy works best for specific type of access patterns or the workload.<sup><span id="27body">[[#27foot\|[27]]]</span></sup> Most latest Intel i7 processors use Pseudo-LRU policy for replacement owing to its lower latency and low power consumption.		There are other replacement policies such as Most Recently Used(MRU), Random Replacement(RR), [https://en.wikipedia.org/wiki/Pseudo-LRU Psuedo-LRU(PLRU)], Segmented LRU(SLRU) and so on. Each type of replacement policy works best for specific type of access patterns or the workload.<sup><span id="27body">[[#27foot\|[27]]]</span></sup> Most latest Intel i7 processors use Pseudo-LRU policy for replacement owing to its lower latency and low power consumption.

	=Prefetching=		=Prefetching=

Vgarg2: /* Cache Replacement Policies */

2015-03-02T23:50:49Z

Cache Replacement Policies

← Older revision		Revision as of 23:50, 2 March 2015
Line 201:		Line 201:
	*[https://en.wikipedia.org/wiki/Least_frequently_used Least Frequently Used(LFU)] policy: In LFU policy, counter is maintained on how often the cache line is referenced and evicts the block from least frequently used cache line.		*[https://en.wikipedia.org/wiki/Least_frequently_used Least Frequently Used(LFU)] policy: In LFU policy, counter is maintained on how often the cache line is referenced and evicts the block from least frequently used cache line.

	There are other replacement policies such as Most Recently Used(MRU), Random Replacement(RR), [https://en.wikipedia.org/wiki/Pseudo-LRU Psuedo-LRU(PLRU)], Segmented LRU(SLRU) and so on. Each type of replacement policy works best for specific type of access patterns or the workload.<sup><span id="27body">[[#27foot\|[27]]]</span></sup>		There are other replacement policies such as Most Recently Used(MRU), Random Replacement(RR), [https://en.wikipedia.org/wiki/Pseudo-LRU Psuedo-LRU(PLRU)], Segmented LRU(SLRU) and so on. Each type of replacement policy works best for specific type of access patterns or the workload.<sup><span id="27body">[[#27foot\|[27]]]</span></sup> Most latest Intel i7 processors use Pseudo-LRU policy for replacement owing to its lower latency and low power consumption.

	=Prefetching=		=Prefetching=

← Older revision		Revision as of 01:18, 3 March 2015
Line 268:		Line 268:
	==Software vs. Hardware solutions==		==Software vs. Hardware solutions==

	Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime ~~system~~ and some cases also require hardware assistance.		Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime systems and in some cases also require hardware assistance.

	<ref>[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon</ref>Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.		<ref>[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon</ref>Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which software schemes perform significantly worse than hardware schemes are when there is greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.

	==Cache Coherence Schemes – Fetch and Replacements==		==Cache Coherence Schemes – Fetch and Replacements==
Line 294:		Line 294:

	===Directory Based Cache Coherence Schemes===		===Directory Based Cache Coherence Schemes===
	#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the ~~processor~~ must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.		#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processors must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.
	#Fetch and Replacement Scenarios:<ref>[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti</ref>		#Fetch and Replacement Scenarios:<ref>[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti</ref>
	##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs.		##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs.
Line 301:		Line 301:
	##Block is Shared: The block requested by a Processor P1 is in shared state.		##Block is Shared: The block requested by a Processor P1 is in shared state.
	##*Read Miss: Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.		##*Read Miss: Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.
	##*Write Miss: Requesting Processor P1 is sent the value. All processors sharing the data are sent ~~and~~ invalidate message. The block is made exclusive for the Processor P1.		##*Write Miss: Requesting Processor P1 is sent the value. All processors sharing the data are sent an invalidate message. The block is made exclusive for the Processor P1.
	##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.		##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.
	##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent ~~the~~ notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.		##*Read Miss: When a processor P2 raises a fetch request for the block X, P1 is sent a notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to the requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.
	##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied		##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied
	##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.		##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.

← Older revision		Revision as of 00:44, 3 March 2015
Line 202:		Line 202:

	=Prefetching=		=Prefetching=
	[http://en.wikipedia.org/wiki/Instruction_prefetch Prefetching] is a technique that can be used in addition to write-hit and write-miss policies to ~~improved~~ the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once ~~that~~ data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache ~~miss~~ are expensive to the performance as the processor has to wait for ~~that~~ data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the prefetched data will be used. ~~Goal~~ here is to reduce cache misses to improve overall performance.<br><br>		[http://en.wikipedia.org/wiki/Instruction_prefetch Prefetching] is a technique that can be used in addition to write-hit and write-miss policies to improve the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once the data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache misses are expensive to the performance as the processor has to wait for the data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the prefetched data will be used though. The goal here is to reduce the cache misses to improve the overall performance.<br><br>
	Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as [http://www.ece.lsu.edu/tca/papers/vanderwiel-00.pdf software prefetching]. Software prefetching can be of further two types: binding and non-binding. Binding pre-fetches into a register using a regular load while non-binding prefetches into cache using a special instruction. In [http://www.ece.lsu.edu/tca/papers/vanderwiel-00.pdf hardware prefetching], processor observers the system behavior and issues requests for prefetching. It can be tuned to system implementation and has no code portability issues. Next Line prefetchers, [http://www.bergs.com/stefan/general/general_slides/sld011.htm Stride Prefetchers] and [http://www.owlnet.rice.edu/~elec525/projects/SBpresentation.pdf Stream Prefetchers] are some of the modern techniques to implement hardware based prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data.<sup><span id="8body">[[#8foot\|[8]]]</span></sup>. The following image shows how a prefetch is organised in a memory system.<sup><span id="26body">[[#26foot\|[26]]]</span></sup>		Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as [http://www.ece.lsu.edu/tca/papers/vanderwiel-00.pdf software prefetching]. Software prefetching can be of further two types: binding and non-binding. Binding pre-fetches into a register using a regular load while non-binding prefetches into cache using a special instruction. In [http://www.ece.lsu.edu/tca/papers/vanderwiel-00.pdf hardware prefetching], processor observers the system behavior and issues requests for prefetching. It can be tuned to system implementation and has no code portability issues. Next Line prefetchers, [http://www.bergs.com/stefan/general/general_slides/sld011.htm Stride Prefetchers] and [http://www.owlnet.rice.edu/~elec525/projects/SBpresentation.pdf Stream Prefetchers] are some of the modern techniques to implement hardware based prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data.<sup><span id="8body">[[#8foot\|[8]]]</span></sup>. The following image shows how a prefetch is organised in a memory system.<sup><span id="26body">[[#26foot\|[26]]]</span></sup>
	[[Image:Fetch_1.JPG\|center\|alt=How a Prefetch fits in memory system]]		[[Image:Fetch_1.JPG\|center\|alt=How a Prefetch fits in memory system]]
Line 222:		Line 222:
	# Accuracy is defined as fraction of prefetches that are useful.		# Accuracy is defined as fraction of prefetches that are useful.
	# Timeliness measures how early the prefetches arrive.		# Timeliness measures how early the prefetches arrive.
	Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.		Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause a cache miss.

	==Stream Buffer Prefetching==		==Stream Buffer Prefetching==
	This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are checked for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.		This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has a address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are checked for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address is assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.
	<br>		<br>
	<br>		<br>
Line 235:		Line 235:
	On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br>		On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.<br><br>
	In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:		In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:
	# Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it ~~many~~ simply ignore this. Now, when P1 ~~ties~~ to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost.<sup><span id="6body">[[#6foot\|[6]]]</span></sup>		# Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it may simply ignore this. Now, when P1 tries to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost.<sup><span id="6body">[[#6foot\|[6]]]</span></sup>
	# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.		# Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to a miss and if not then it will waste bandwidth.
	# In a multiprocessor system, if threads are not bound to a core, ~~Operation~~ system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.		# In a multiprocessor system, if threads are not bound to a core, Operating system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.

	Following are two examples of processors that support prefetching directly in the chip design:		Following are two examples of processors that support prefetching directly in the chip design:
Line 250:		Line 250:
	===Prefetching in AMD<sup><span id="12body">[[#12foot\|[12]]]</span></sup>===		===Prefetching in AMD<sup><span id="12body">[[#12foot\|[12]]]</span></sup>===

	The AMD Family 10h processors ~~use~~ a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)		The AMD Family 10h processors uses a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)

	The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though. When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory. For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected. Subsequent detection of misses of sequential blocks may only prefetch a single block. This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.		The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though. When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from the memory. For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected. Subsequent detection of misses of sequential blocks may only prefetch a single block. This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.

	AMD contends that this is beneficial when processing large data sets which often process sequential data and ~~process~~ all the data in the stream. In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU. AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing.		AMD contends that this is beneficial when processing large data sets which often process sequential data and processes all the data in the stream. In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU. AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing.

	The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.		The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.