Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:51:18Z

Dtiwari2: /* References */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf previous works]have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

==<tt>'''Conclusion'''</tt>==
The performance of parallel applications is influenced by its sharing behavior and varies significantly with cache block size. A cache block can
be shared by several processors either on a per-processor basis or be finely shared by several processors.

Data layout schemes and dynamic sub-blocking of the cache line have proved successful in reducing the false sharing overheads over the years. True
sharing is something which is inherent to the parallel program and requires the processors involved to explicitly synchronize with each other
to ensure program correctness. True sharing can’t be completely eliminated from the parallel programs but methods have been proposed to reduce
overhead because of true sharing inter-processor communication.

== <tt>'''References'''</tt> ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

9.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

10. http://suif.stanford.edu/papers/anderson95/node2.html

11. [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

12.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

13.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

14. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:50:57Z

Dtiwari2: /* Conclusion */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf previous works]have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

==<tt>'''Conclusion'''</tt>==
The performance of parallel applications is influenced by its sharing behavior and varies significantly with cache block size. A cache block can
be shared by several processors either on a per-processor basis or be finely shared by several processors.

Data layout schemes and dynamic sub-blocking of the cache line have proved successful in reducing the false sharing overheads over the years. True
sharing is something which is inherent to the parallel program and requires the processors involved to explicitly synchronize with each other
to ensure program correctness. True sharing can’t be completely eliminated from the parallel programs but methods have been proposed to reduce
overhead because of true sharing inter-processor communication.

== References ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

9.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

10. http://suif.stanford.edu/papers/anderson95/node2.html

11. [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

12.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

13.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

14. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:50:06Z

Dtiwari2: /* <tt>'''Techniques to reduce true sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf previous works]have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

==Conclusion==
The performance of parallel applications is influenced by its sharing
behavior and varies significantly with cache block size. A cache block can
be shared by several processors either on a per-processor basis or be
finely shared by several processors.
Data layout schemes and dynamic sub-blocking of the cache line have proved
successful in reducing the false sharing overheads over the years. True
sharing is something which is inherent to the parallel program and
requires the processors involved to explicitly synchronize with each other
to ensure program correctness. True sharing can’t be completely eliminated
from the parallel programs but methods have been proposed to reduce
overhead because of true sharing inter-processor communication.

== References ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

9.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

10. http://suif.stanford.edu/papers/anderson95/node2.html

11. [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

12.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

13.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

14. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:49:07Z

Dtiwari2: /* Bibiliography */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf previous works]have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

9.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

10. http://suif.stanford.edu/papers/anderson95/node2.html

11. [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

12.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

13.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

14. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:48:48Z

Dtiwari2: /* Bibiliography */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf previous works]have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

9.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

10. http://suif.stanford.edu/papers/anderson95/node2.html

11. [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

12.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

13.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

14. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:46:59Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf previous works]have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

9. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

10.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

11. http://suif.stanford.edu/papers/anderson95/node2.html

12 [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

13.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

14.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:46:23Z

Dtiwari2: /* Bibiliography */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

9. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

10.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

11. http://suif.stanford.edu/papers/anderson95/node2.html

12 [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

13.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

14.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:45:06Z

Dtiwari2: /* Bibiliography */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [ http://parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

9. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

10.[http://www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

11. http://suif.stanford.edu/papers/anderson95/node2.html

12 [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

13.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

14.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:44:35Z

Dtiwari2: /* Bibiliography */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. [http://www.eecs.berkeley.edu/Pubs/TechRpts/1988/6056.html S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.]

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

7. [ parasol.tamu.edu/~rwerger/Courses/689/spring2002/day-3-ParMemAlloc/papers/lee96effective.pdf J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.]

8. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

9. [http://citeseer.ist.psu.edu/kadiyala95dynamic.html Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.]

10.[www.virtutech.com/about/research/pdf/zhao-ICIS-L2cache.pdf Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP]

11. http://suif.stanford.edu/papers/anderson95/node2.html

12 [http://www.intel.com/cd/ids/developer/asmo-na/eng/43813.htm False sharing in threaded programming environment and potential solutions]

13.[http://docs.hp.com/en/B3909-90003/ch13s02.html HP Article]

14.[http://www.usenix.org/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node10.html Effect of false sharing]

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:37:55Z

Dtiwari2: /* Bibiliography */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

5. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 Analysis of shared memory misses and reference patterns by Rothman et. al.]

6. [http://citeseer.ist.psu.edu/160021.html Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations by Tor E. Jeremiassen and Susan J. Eggers]

5. J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

6. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

7. Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

8. http://suif.stanford.edu/papers/anderson95/node2.html

9. Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:35:01Z

Dtiwari2: /* References */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4. S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

5. J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

6. Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

7. Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

8. http://suif.stanford.edu/papers/anderson95/node2.html

9. Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:33:33Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

''' [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme'''

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.'''

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:32:30Z

Dtiwari2: /* <tt>'''Techniques to reduce true sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:32:10Z

Dtiwari2: /* <tt>'''Techniques to reduce true sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]
'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache'''

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:31:35Z

Dtiwari2: /* <tt>'''Techniques to reduce true sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5] shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

'''[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True1.JPG Figure 5]: A possible SPS2 cache architecture to reduce true sharing latency.'''

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6] shows the sub-blocking in a cache.

[[Image:True2.JPG]]
'''
[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:True2.JPG Figure 6]: Sub-blocking in a cache
'''
A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:27:11Z

Dtiwari2: /* <tt>'''Techniques to reduce true sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact, true misses are required for the correct execution of the program. The techniques used for true sharing is mostly involved with reducing the latency and bus traffic caused by miss.

* One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) . Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

*Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:21:47Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:20:15Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

More about benchmarks used :
*[http://www-flash.stanford.edu/apps/SPLASH/ MP3D],
*Floyd’s algorithm is designed to find the least-expensive paths between all the vertices in a graph. It does this by operating on a matrix
representing the costs of edges between vertices. More [http://www-unix.mcs.anl.gov/dbpp/text/node35.html here]
*[http://www-flash.stanford.edu/apps/SPLASH/ Water]
*The Jacobi method is an algorithm in linear algebra for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. More [ http://en.wikipedia.org/wiki/Jacobi_method here].

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:16:08Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, previous works have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:15:21Z

Dtiwari2: /* Techniques to reduce true sharing misses */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== <tt>'''Techniques to reduce true sharing misses'''</tt> ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:13:22Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False03.JPG Figure 3] New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4].

[[Image:False04.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False04.JPG Figure 4] The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:11:07Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 reference] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html reference]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 3 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 3: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme (Figure 4).

[[Image:False04.JPG]]

Figure 4: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:09:24Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 source] This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://citeseer.ist.psu.edu/160021.html sources]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, optimization is not applied.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than one.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 3 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 3: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme (Figure 4).

[[Image:False04.JPG]]

Figure 4: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:05:53Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 sources],[http://citeseer.ist.psu.edu/160021.html sources]''' This method addresses the problem of reducing false sharing misses on shared data by improving the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.

Four optimizations of the data layout have been suggested [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=878285 sources],[http://citeseer.ist.psu.edu/160021.html sources]to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 3 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 3: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme (Figure 4).

[[Image:False04.JPG]]

Figure 4: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T03:01:21Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

* '''Split Scalar''': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* '''Heap Allocate''': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* '''Expand Record''': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

* '''Align Record''': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''II. Dynamic Cache sub-block Design to Reduce False Sharing ''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 3 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 3: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme (Figure 4).

[[Image:False04.JPG]]

Figure 4: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T02:55:14Z

Dtiwari2: /* <tt>'''Techniques to reduce false sharing misses'''</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

* ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

* ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

* ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

* ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [http://citeseer.ist.psu.edu/kadiyala95dynamic.html reference]''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 3 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 3: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme (Figure 4).

[[Image:False04.JPG]]

Figure 4: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T02:53:01Z

Dtiwari2: /* '''<tt>Techniques to reduce false sharing misses</tt>''' */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== <tt>'''Techniques to reduce false sharing misses'''</tt> ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''I.'' Data placement optimizations'' ''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 3 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 3: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme (Figure 4).

[[Image:False04.JPG]]

Figure 4: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 5 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG]]

Figure 5: A possible SPS2 cache architecture to reduce true sharing latency.

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

Sub-blocking in the cache can also help in reducing the bus communication because of true sharing misses. Figure 6 shows the sub-blocking in a cache.

[[Image:True2.JPG]]

Figure 6: Sub-blocking in a cache

A valid bit is associated with each sub-block in the cache line. The sub-block might be equivalent to a word process writes or reads. Now if a true sharing miss occurs then only the specific word written by the processor in the cache line needs to send across the bus (not the whole cache line).

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T02:20:16Z

Dtiwari2: /* <tt>Techniques to reduce false sharing misses</tt> */

== <tt>'''Introduction'''</tt> ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== <tt>'''Effect of false sharing'''</tt> ==
Id a one processor writes a word in particular block then other processor invalidates shared copy of the same block, subsequently later process intends to read the same word in the same block, then the reference is a true sharing reference and would cause a miss independent of the block size. If, however, the word being written and the word being read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. Evidently, if the cache block size is increased, false sharing misses would also go up. Though, large block sizes exploit locality and decrease the effective memory access time, full cache block (all words in the block) is not necessarily useful to a particular processor rather mostly only part of it is relevant. In general, ['''fix--this'''previous works] have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1] and [http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2]([http://ieeexplore.ieee.org/iel3/4266/12232/00562898.pdf?arnumber=562898 source]) depict the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False01.JPG Figure 1:] The number of false sharing misses and true sharing misses during execution of '''parallel application Barnes'''.
The Barnes-Hut simulation is an algorithm for performing an N-body simulation having order <tt>O(n log n) </tt>. More information [http://en.wikipedia.org/wiki/Barnes-Hut_simulation here].

[[Image:False02.JPG]]

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:False02.JPG Figure 2:] The number of false sharing misses and true sharing misses during execution of '''parallel application Cholesky'''. More about Cholesky decomposition [http://en.wikipedia.org/wiki/Cholesky_decomposition here].

Not only do false misses increase the memory accesses latency, but also they generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic.Generally, between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== '''<tt>Techniques to reduce false sharing misses</tt>''' ==

Various hardware and software techniques have been proposed to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''1. Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 3 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 3: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block (DSB) protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block (FSB) scheme (Figure 4).

[[Image:False04.JPG]]

Figure 4: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.

In fact true misses are required for the correct execution of the program. The techniques used for true sharing is majorly involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors (also called SPS2 system) [7]. Figure 4 shows one of the possible implementation of a private L1 cache and shared L2 cache architecture.
[[Image:True1.JPG|thumb|Figure 4: A possible SPS2 cache architecture to reduce true sharing latency.]]

Shared L2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. Data in private L1 and L2 are exclusive, but private L1 and shared L2 could be inclusive. Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local L2. So the shared data will be placed in shared L2 which collectively provide high storage capacity to help reduce off-chip access.

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T01:59:07Z

Dtiwari2: /* Techniques to reduce false sharing misses */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T01:56:11Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T01:48:25Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T01:42:27Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T01:39:09Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-25T01:36:17Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:47:38Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:47:01Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:42:29Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:41:45Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:38:49Z

Dtiwari2: /* <tt>'''Effect of false sharing'''</tt> */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:16:30Z

Dtiwari2: /* Effect of false sharing */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:16:03Z

Dtiwari2: /* '''Introduction''' */

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:15:21Z

Dtiwari2: /* Introduction */

== '''Introduction''' ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== Effect of false sharing ==
If the word (in a block) modified is actually used by the processor that received the invalidation, then the reference was a true sharing reference and would have caused a miss independent of the block size. If, however, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. It is evident if the cache block size is increased, false sharing misses would also increase. Though, large block sizes exploit locality and decrease the effective memory access time, it also has a tendency to group data together even though only a part of it is needed by any one processor. In general, past studies have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size [2].
Following figures depicts the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]] [[Image:False02.JPG]]

Figure 1: The number of false sharing misses and true sharing misses that happen during executions of each parallel application (a) Barnes and (b) Cholesky [3].

Not only do false misses increase the latency of memory accesses, they also generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic. The traffic increase with larger blocks occurs because many of the words transferred are not used. Between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== Techniques to reduce false sharing misses ==

There have been various techniques proposed both at hardware and software level to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''1. Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 2 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 2: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block scheme (Figure 3).

[[Image:False04.JPG]]

Figure 3: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.
In fact true misses are required for the correct execution of the program. The techniques used for true sharing us majoring involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors [7].

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T20:14:55Z

Dtiwari2: /* Introduction */

== Introduction ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the ''three C’s'' (capacity, compulsory, and conflict misses[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

* '''True sharing misses''': These misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block and'' writes a particular word'' in that block. Subsequently, when another processor attempts to ''read the same modified word in the same cache block'', a miss occurs and the corresponding block is transferred.

* '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because a particular word in the block, other than one being read, is getting written.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data, ''for example commercial workloads''.

== Effect of false sharing ==
If the word (in a block) modified is actually used by the processor that received the invalidation, then the reference was a true sharing reference and would have caused a miss independent of the block size. If, however, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. It is evident if the cache block size is increased, false sharing misses would also increase. Though, large block sizes exploit locality and decrease the effective memory access time, it also has a tendency to group data together even though only a part of it is needed by any one processor. In general, past studies have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size [2].
Following figures depicts the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]] [[Image:False02.JPG]]

Figure 1: The number of false sharing misses and true sharing misses that happen during executions of each parallel application (a) Barnes and (b) Cholesky [3].

Not only do false misses increase the latency of memory accesses, they also generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic. The traffic increase with larger blocks occurs because many of the words transferred are not used. Between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== Techniques to reduce false sharing misses ==

There have been various techniques proposed both at hardware and software level to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''1. Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 2 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 2: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block scheme (Figure 3).

[[Image:False04.JPG]]

Figure 3: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.
In fact true misses are required for the correct execution of the program. The techniques used for true sharing us majoring involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors [7].

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T18:54:05Z

Dtiwari2: /* Introduction */

== Introduction ==

In multiprocessor computing system with snoopy coherence protocol, the overall performance is a contributed by two factors
* Uniprocessor cache miss traffic
* The traffic caused by the communication, which generates invalidations and subsequent cache misses.
The uniprocessor misses are categorized into the [[''three C’s'']] (capacity, compulsory, and conflict [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]).However, the misses arising from interprocessor communication are named as coherence misses and can be divided into two different types:

1. '''[http://google.com True sharing misses]''': Theses misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block. Subsequently, when another processor attempts to read a modified word in that cache block, a miss occurs and the resultant block is transferred.

2. '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because some word in the block, other than one being read, is written into.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data for example commercial workloads.

== Effect of false sharing ==
If the word (in a block) modified is actually used by the processor that received the invalidation, then the reference was a true sharing reference and would have caused a miss independent of the block size. If, however, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. It is evident if the cache block size is increased, false sharing misses would also increase. Though, large block sizes exploit locality and decrease the effective memory access time, it also has a tendency to group data together even though only a part of it is needed by any one processor. In general, past studies have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size [2].
Following figures depicts the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]] [[Image:False02.JPG]]

Figure 1: The number of false sharing misses and true sharing misses that happen during executions of each parallel application (a) Barnes and (b) Cholesky [3].

Not only do false misses increase the latency of memory accesses, they also generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic. The traffic increase with larger blocks occurs because many of the words transferred are not used. Between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== Techniques to reduce false sharing misses ==

There have been various techniques proposed both at hardware and software level to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''1. Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 2 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 2: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block scheme (Figure 3).

[[Image:False04.JPG]]

Figure 3: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.
In fact true misses are required for the correct execution of the program. The techniques used for true sharing us majoring involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors [7].

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T18:42:48Z

Dtiwari2: /* Introduction */

== Introduction ==

In a multiprocessor using a snoopy coherence protocol, the overall performance is a combination of the behavior of uniprocessor cache miss traffic and the traffic caused by the communication, which results in invalidations and subsequent cache misses. The uniprocessor miss rate is classified into the three C’s (capacity, compulsory, and conflict [1]). Similarly, the misses that arise from interprocessor communication are called coherence misses and can be divided into two different sources:

1. '''[http://google.com True sharing misses]''': Theses misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block. Subsequently, when another processor attempts to read a modified word in that cache block, a miss occurs and the resultant block is transferred.

2. '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because some word in the block, other than one being read, is written into.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data for example commercial workloads.

== Effect of false sharing ==
If the word (in a block) modified is actually used by the processor that received the invalidation, then the reference was a true sharing reference and would have caused a miss independent of the block size. If, however, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. It is evident if the cache block size is increased, false sharing misses would also increase. Though, large block sizes exploit locality and decrease the effective memory access time, it also has a tendency to group data together even though only a part of it is needed by any one processor. In general, past studies have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size [2].
Following figures depicts the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]] [[Image:False02.JPG]]

Figure 1: The number of false sharing misses and true sharing misses that happen during executions of each parallel application (a) Barnes and (b) Cholesky [3].

Not only do false misses increase the latency of memory accesses, they also generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic. The traffic increase with larger blocks occurs because many of the words transferred are not used. Between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== Techniques to reduce false sharing misses ==

There have been various techniques proposed both at hardware and software level to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''1. Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 2 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 2: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block scheme (Figure 3).

[[Image:False04.JPG]]

Figure 3: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.
In fact true misses are required for the correct execution of the program. The techniques used for true sharing us majoring involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors [7].

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki3 1 ncdt

2007-10-24T18:38:18Z

Dtiwari2:

== Introduction ==

In a multiprocessor using a snoopy coherence protocol, the overall performance is a combination of the behavior of uniprocessor cache miss traffic and the traffic caused by the communication, which results in invalidations and subsequent cache misses. The uniprocessor miss rate is classified into the three C’s (capacity, compulsory, and conflict [1]). Similarly, the misses that arise from interprocessor communication are called coherence misses and can be divided into two different sources:

1. ['''True sharing misses''']: Theses misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block. Subsequently, when another processor attempts to read a modified word in that cache block, a miss occurs and the resultant block is transferred.

2. '''False sharing misses''': These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because some word in the block, other than one being read, is written into.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data for example commercial workloads.

== Effect of false sharing ==
If the word (in a block) modified is actually used by the processor that received the invalidation, then the reference was a true sharing reference and would have caused a miss independent of the block size. If, however, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. It is evident if the cache block size is increased, false sharing misses would also increase. Though, large block sizes exploit locality and decrease the effective memory access time, it also has a tendency to group data together even though only a part of it is needed by any one processor. In general, past studies have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size [2].
Following figures depicts the impact of false sharing for different parallel application using 32 processors:
[[Image:False01.JPG]] [[Image:False02.JPG]]

Figure 1: The number of false sharing misses and true sharing misses that happen during executions of each parallel application (a) Barnes and (b) Cholesky [3].

Not only do false misses increase the latency of memory accesses, they also generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic. The traffic increase with larger blocks occurs because many of the words transferred are not used. Between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

== Techniques to reduce false sharing misses ==

There have been various techniques proposed both at hardware and software level to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

'''1. Data placement optimizations:''' This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects.
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.
The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 2 shows the cache organization to support dynamic cache sub-block design.

[[Image:False03.JPG]]

Figure 2: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block scheme (Figure 3).

[[Image:False04.JPG]]

Figure 3: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

== Techniques to reduce true sharing misses ==

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.
In fact true misses are required for the correct execution of the program. The techniques used for true sharing us majoring involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors [7].

== References ==

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson
[2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP

CSC/ECE 506 Fall 2007/wiki1 12 dp3

2007-09-11T03:59:12Z

Dtiwari2: /* Introduction */

Sections 1.3.3 and 1.3.4: Most changes here are probably related to performance metrics. Cite other models for measuring artifacts such as data-transfer time, overhead, occupancy, and communication cost. Focus on the models that are most useful in practice.

== Communication and Replication ==

In this section, we describe two terms Communication and [http://en.wikipedia.org/wiki/Replication_(computer_science) Replication], simultaneously we also make distinction between these two terms.

Communication between any two [http://en.wikipedia.org/wiki/Process_(computing) processes] is said to occur when data written by one process is read by second process. This causes a data transfer between the processes however, if data is just stored at one process (because initially data was configured to be on this process or it was too large to fit at any other place) and transfer only makes another copy of the data at second process then it called replication.
For example, on processor’s request of data if we copy something from main memory and put it in cache this operation is replication of data. On the contrast if a data is produced by a sender process and it is transferred to a receiver process by message passing then it is an example of communication.

Communication and replication both involves data transfer, which can be defined as transfer of data across different memory locations. For interprocess(or) communication the data is transferred across the memory local to the communicating processor or from a remote storage device. When a miss occurs in cache, the data is transferred from the memory to the cache. In case, where the cache content, as a result of replication, is updated or changed, these changes must be transported to all the other hidden replicas. This is another aspect of data transfer.

== Performance ==

=== Introduction ===

In this section, we briefly discuss various importance aspects of performance measurement in parallel computer architecture and basic performance metrics.

As we already know, performance measurement is one of the fundamental issues in [http://en.wikipedia.org/wiki/Uniprocessor uniprocessor system] where architects focus on improving performance by reducing execution time of standard programs called benchmarks. They use several techniques such as minimizing memory access time, designing hardware which can execute many instruction in parallel and possibly faster ([http://en.wikipedia.org/wiki/Instruction_level_parallelism micro level parallelism] extraction) etc. Performance measurement is more serious concern in parallel computing because apart from computing performance measurement we also need to analyze communication cost as data is shared among many processors and processes (possibly on different processors) need to communicate efficiently, coherently and correctly.

To make our point more precise, let us consider the following example:

:Assume we want to run a program which takes 100sec on uniprocessor. However, we also know that the full program can be decomposed in many processes and these processes can be run on different machine. So basically, in best case we expect the speed up of <tt>n</tt> where <tt>n</tt> is the number of processors available. We have divided the computing load but these processes can not run independently to achieve the completion. To run the program correctly, these processes do need to communicate to each other for data sharing, [http://en.wikipedia.org/wiki/Synchronization synchronization] etc. Hence, there is communication overhead involved in parallel computing.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:P02.jpg Following figure] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) helps in understanding how parallel processers typically spend their execution time on different activities like: local/remote data access, computing, synchronization and other work.

[[Image: P02.jpg| P02.jpg]]

A sequential program takes 100 sec to run the program, in this hypothetical example it is assumed that 80% time is busy-useful (i.e.execution of instructions) time however rest 20% of the time processor spends in accessing local data.
Four parallel processors solve the same problem in 55 seconds (speed up of 1.8 instead of expected speed up of 4). Parallel processors spend time in accessing data at both locations : remote and local. These processors execute instructions which we call busy useful time and moreover they synchronize with each other to execute the program correctly. However in such a parallel computing environment, processors execute some instructions/work which are not needed if program is run sequentially, time spent in such activities is called busy-overhead time. Such type of work is also called 'extra work' which we will discuss shortly.

Clearly, a wise architect would not like to have any parallel system where communication overheads overwhelm speed up achieved by dividing the computing-load. In other words, diving computing load on different processors is good idea only when communication costs do not shoot up too much. Similarly, in order to reduce communication cost one should not kill the inherent parallelism available in the program.

[http://en.wikipedia.org/wiki/Speedup Speedup] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) gain by parallel computing has to take into the account the synchronization time, communication cost and extra work apart from computing work. So, speed up can be expressed as following:

[[Image: P01.jpg| P01.jpg]]

In the equation above, 'extra work' is work done by processors other than computation, synchronization and communication. This might include:
:Computing a good partition for a particular problem.
:Using redundant computation to avoid communication.
:Task, data and process management overhead etc.

Hence, it is obvious that in order to increase the speed up (improve performance), architects would focus on all the factors appearing in the denominator of above equation. Therefore, we need to consider various design trade-offs while analyzing performance of parallel computing architecture.

There are three basic performance metrics.
* [http://en.wikipedia.org/wiki/Latency_(engineering) Latency] : Time taken by an operation to get completed. ( measured as seconds per operation)
* [http://en.wikipedia.org/wiki/Bandwidth Bandwidth]: The rate at which operations are executed. (measured as operations per second)
* Cost: Cost is basically impact of operations on total execution time of the program. (measured as latency times number of operations)

In uniprocessor system, bandwidth is simply reciprocal of latency however, in parallel computing many operations are performed concurrently so relationship among performance metrics is not simple.In parallel computing, we need to consider the performance for communication operations along with computing operations.

We list three [http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Artifacts_of_Measuring_Performance artifacts of measuring performance] and since data transfer operations are the most frequent type of communication operation, discussion on the same appears first.

=== Artifacts of Measuring Performance ===

*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Data_Transfer Data Transfer]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Overhead_and_Occupancy Overhead and Occupancy]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Communication_Cost Communication Cost]

==== Data Transfer ====

For any data transfer we would like to estimate the time it consumes, so that we can improve the overall performance of the system by reducing data transfer time. To estimate the data transfer time a simple ''[http://en.wikipedia.org/wiki/Linear_model linear model]'' is used ( [http://www.cs.berkeley.edu/~culler/cs258-s99/#lectures referenced lecture material]):

<tt><center> '''Total Transfer Time (T) = Start-up Time (T0) + Data Transfer Time (TD)'''</center> </tt>

Total transfer time has two components:
1. A constant term (T0) which is called start up cost. We will shortly return to this with more details.
2. Data transfer time, which is estimated as following:

<tt><center>'''Data Transfer Time (TD)) = Size of Data (n) / Bandwidth (B)'''</center></tt>

Bandwidth (B) is also called data transfer rate.

To have better understanding of the model, we should be clear about the following points:

* If we have only one pair of host then data transfer rate is simply the bandwidth of the link connecting those hosts.
* However if there are many hosts between the source host and destination host, bottleneck is the link with lowest bandwidth.

Important point to note is that the '''achievable bandwidth''' depends on the transfer size, that is why sometimes bandwidth (<tt>'''B'''</tt>) is called as '''peak bandwidth'''. For example:

::Suppose we have two hosts connected by a link with bandwidth of 20MB/s and start up cost of communication is 2 micro seconds. We want to transfer an image of size 40MB then the total transfer time is 2 seconds plus 2 micro seconds. Given the available peak bandwidth of 20MB/s, one might have expected to complete the transfer in 2 seconds achieving the peak bandwidth but start up cost prohibits this. Clearly as you increase the amount of data achievable bandwidth approaches the asymptotic rate of bandwidth (B), in fact start up cost determines how fast the asymptotic rate would be achieved.

As a special case, the amount of data required to achieve half of peak bandwidth (<tt>'''B'''</tt>) is equal to <tt>'''T0 X B'''</tt>. This is also called '''half-power point'''. Please note that printed version of the [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 text-book] has erroneous formula for calculating half-power point.

Now, we discuss first part of total transfer-time i.e. start-up cost. Notice that <tt>'''T0'''</tt> is a constant term for a particular data transfer,but it might vary as we consider data transfer over different entities. ''For example'', in memory operation start up cost is memory access time. In message passing the start up cost can be estimated as time taken by fist bit to reach destination. For pipelined operations, start up cost is simply time taken to fill up the pipeline. For bus transactions it is arbitration and command phases. 

As parallel computing has advanced, one of the major focuses has been to ''reduce start up cost''. There are many ways to do so; we describe few of them here. As stated earlier start up cost for memory operations is basically the memory access time. To reduce memory access time, architects have introduced costly (hence small size) but fast storage area called [http://en.wikipedia.org/wiki/Cache cache]. Depending upon the [http://en.wikipedia.org/wiki/Memory_locality spatial and temporal locality] cache is filled with useful items and hence processor does not have to go to memory (long [http://en.wikipedia.org/wiki/Memory_latency latency] ) every time it needs data.
Average access time is governed by the following formula [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]:

<tt><center> '''Average memory access time = Hit time for the cache + (Miss Rate X Miss Penalty)'''</center></tt>

Therefore architects often try to reduce all three components by adopting [http://citeseer.ist.psu.edu/kowarschik03overview.html different cache optimization] like: multilevel cache, larger blocks size, higher associatively etc.

We quote the access time of cache and main memory to see how beneficial it might be to introduce cache if we manage to get considerably high hit rates.
Cache access time is typically 0.5-25 ns while for the main memory it is 50-250 ns, so we can decrease start-up cost considerably by having such a memory hierarchy. Bandwidth for caches range around 5,000-20,000 MB/sec but for memory its as low as 2,500-10,000MB/sec. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]

Similarly for ''bus transactions'', start up cost is time spent during arbitration and command phases, suppose on a 3.6 GHz Intel Xeon machine (year 2004) it takes 3 cycles to arbitrate the bus and present the address the start up cost is around 0.83 nano seconds. However assuming that around year 1994-95 it took same 3 cycles on Alpha 21064A 0.3 GHz processor we can see that start up cost has been reduced by more than 10 times. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Chapter 1]

''Pipelining'' is another way to reduce the data transfer time, for pipelined systems filling up the pipeline is the total start up cost. Though it seems that introducing pipeline adds extra start up cost, however more importantly pipeline allows multiple operations to take place concurrently and this indeed helps in achieving higher bandwidth.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix A]

Startup cost calculation is many times challenging, with enhancement in technology focus is too reduce the start-up cost and increase bandwidth.

[http://pg.ece.ncsu.edu/mediawiki/images/3/37/P03.jpg Next plot] (on log-log scale) shows time for message passing operation for several machines as a function of message size [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 source]. We notice the start-up costs for different machines vary a lot (nearly spread over an order of magnitude). These start up costs vary a lot and total transfer size is non linear function of message size for small amount of data, contrary to linear data transfer model. For big size data transfer we get almost linear relationship. The bandwidth can be calculated by slope of the line.

[[Image:P03.jpg | Time for message passing operation versus message size ]],

Quoting other figures,

'''iPSC/2''' machine has start up cost of 700 micro seconds and '''CRAY T3D (PVM)''' machine has start up cost of 21 micro seconds, we can clearly see the trends that within one decade start up cost has dropped by an order of magnitude. '''NOW''' machine has start up cost of just 16 micro seconds. Basically these improvements are essentially due to improvement in cycle time.

Similar trends can be observed in 'Maximum Bandwidth' achievable on each machine for data transfer. For '''nCUBE/2''' machine maximum bandwidth was 2 MB/s but relatively advanced machines like '''SP2''' has bandwidth of 40 MB/s.

: However, this data transfer model has few shortcomings too.
This model does not indicate when the next operation can be performed. Estimating time interval between two operations is particularly very useful because bandwidth depends on how frequently operations can be initiated.
This model also does not tell about whether other useful work can be done during transfer or not.
This model is easy to understand but it is not very suitable for architectural evaluation. For network transactions, total message time is difficult to measure unless there is a global clock as the send and receive usually happen on different processors.So, transaction time is usually measured by doing a echo test (i.e. one processor sends the data and waits until it receives a message). But this is reliable only if receive is posted before message arrives hence measuring transaction time is very challenging (and not always accurate) in this transfer model.

::In this section we discussed different aspects of data transfer model.In parallel computing environment, data transfers usually take place across the network and it is invoked by processor through the communication assist. Therefore, now we need to look at how communication costs are estimated and what are the important factors to consider?

==== Overhead and Occupancy ====

One of the three components of Processor execution time, apart from ''Computation Time and Idle Time'', is ''Communication Time''. Communication time is the time spent by the processor on exchanging messages with another process(or). There can be two different types of communication, i.e. interprocessor and intraprocessor. In interprocessor communication the two communicating tasks are handled by two different processors. While in intraprocessor communication the communicating tasks are handled by the same processor. Generally both intraprocessor and interprocessor communication costs the same, provided the former is not highly optimized.

Communication time is function of number of bytes (n) transferred across. It can be given as below

<tt><center> '''T(n) = Overhead + Occupancy + Network Delay + Message Size / Bandwidth + Delay due to Contention'''</center></tt>

[[Image:time.jpg]]

Source [http://www.cs.berkeley.edu/~culler/cs258-s99/ Lecture :Culler]

Communication Overhead includes time spent on
*Create messages
*Execute communications protocols
*Physically send messages
*Run through the protocol sets and decode the message on the receiving node.

During this period the processor cannot do any useful or computational work. Parallel programs, running on different processors need to coordinate their work among themselves. This results in increased rate of interprocessor communication, which in turn increases the net overhead cost.

Occupancy is the time spent at the slowest component in the communication assist and it affects performance in couple of ways. It delays the current request and indirectly contributes to the delays of subsequent requests. The occupancy gets to set the upper limit on how frequently communication operation can be initiated by the processor.

''Some of the recent trends/designs helped reduce these communication costs. [http://en.wikipedia.org/wiki/Blue_Gene IBM Blue Gene (L)] uses [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Collective Network], which carries out operations within the network itself. This saves the processors time to decode messages with intermediate values, calculate new intermediate values, create new messages, and send them on to other nodes. The overhead now is primarily because of communication protocol. It also has a dedicated communications network, [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Barrier and Interrupt Network], to speed up task-to-task coordination activities. IBM BG/L also employs [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Torus Network], which results in linear growth of the path length while nodes (processors) scale as a cube. Torus Network also gives the ability to send messages in either direction, something like a ring network and hence reduces the distance between furthest points to half. This in turn reduces the network delays.''

[[Image:torus1.jpg]][[Image:global.jpg]][[Image:giga.jpg]]

(a) Three-dimensional Torus. (b) Global Collective Network. (c) IBM BG/L Control Network & Gigabit Ethernet networks

Source [http://www.research.ibm.com/journal/rd/492/gara.pdf IBM BG/L]

''IBM Blue Gene employs simultaneous send/receive technique in the torus network. Hence if there are N numbers of nodes, then a single node can send/receive with 2N other nodes simultaneously along its 2N different links.''

[[Image:torus.jpg]]

''If the cost of a single send is given by Ts (without simultaneous send) then for ‘S’ simultaneous sends the total cost becomes''
<center>Ts + Ts x (S – 1) x f where f: 0 < f < 1 </center>
''The speedup is 1/f approximately. Below is the performance comparison of an algorithm (used for image processing in microscopy application) running on Blue Gene/L versus an Intel Linux cluster which does not employs simultaneous send.''

[[Image:bll.jpg]]
*source: [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1924514 BMC Cell Biology]

''For Linux Cluster the computation time is very high percentage (up to 40%) of the total time. In Blue Gene it is as less as 5% of total runtime.''

From processor’s point of view there are number of other network delays which can be categorized as occupancy. Contention for resources can be viewed as one of the occupancies. The net bandwidth reduces as a result of this. If P concurrent processors are using a network of Bandwidth B, then the effective bandwidth would be B/P. Contention are basically two types. When it is due to routers and switches it is called network contention. If it is observed at endpoints or processing nodes it is called endpoint contention. When the contention of endpoint type occurs, then all the processing nodes involved are called hot spot. This type of contention can easily alleviate in software.

''The global interrupt and barrier network and global tree network operate in parallel which provides global asynchronous sideband signals. This basically results in lower roundtrip latency, as low as 1.3micro seconds. Network contention can always increase the latency. IBM BG/L uses Virtual cut through (VCT) routing technique.''

==== Communication Cost ====

At the end of the day we want to reduce the communication cost. Communication cost is given by following equation:

<tt><center>'''Communication cost = Frequency of communication x (Communication Time – Overlap)'''</center></tt>

Frequency of communication is self explanatory, which depends on the machine architecture and program design. Some architecture like scale-up symmetric multiprocessing (SMP) and scale-out massively parallel processing (MPP) [http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx (Microsoft Solution: HPC)] systems supports tightly coupled parallel applications. This result in high frequency of communication, which makes it important to have the other parts of communication time like overhead and network delays to be small. Loosely coupled parallel applications, on inherently parallel system architecture, requires minimal inter process communication.

The portion of the communication operation which is performed concurrently with processor engaged in other useful work (computation and other communication) is the overlap. This concept is exploited to obtain high throughput. For instance, each node of IBM BG/L has IBM CMOS ASIC. Each of this ASIC has two independent cores (microprocessor). Virtually there is no difference in the core, each processor can handle its own communication or one processor can be used for communication and another for computation. This way very high degree of overlap is achieved.

=== Scalability ===

Scalability of parallel computer is so important performance metrics that it was worth giving a heading for this. A general perception is that by increasing the number of processors arbitrarily, performance increases. This is not absolutely true while calculating parallel computer performance. Scalability means there exist an isoefficiency function for a parallel system such that upon increasing the size of problem the efficiency remains same. Scalability is bounded by two different limits. Weak scaling- when the load on individual processor remains same but the number of processor is increased. Strong Scaling- the problem size remaining same but load on individual processor is reduced while increasing the processor count. Generally all problem lies in between these two limits.

Is there a limit to the number of processor? Amdahl's law gives a picture of how performance is affected by increasing the number of processor. If a problem size is fixed and if it takes execution time T in the uni-processor system then on parallel system with P processors it will take
<center>T x q + (1 – q) x T / P</center>

Where T x q is time taken by sequential part of the program. Then the speed up is
<center>S = 1 / (q + (1 – q)/P)</center>
This upon simulating gives following result

[[Image:amdhals.jpg]]

Source: Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

It can be concluded from the graph that not necessarily all algorithms will produce high speedups. It depends what we running. Scalability is thus dependent on the application, and hence while scaling up a system the application area needs to be specified.

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4.[http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx Microsoft Solution: High Performance Computing]

5.Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

6.[http://www.research.ibm.com/journal/rd/492/gara.html IBM Blue Gene/L]

CSC/ECE 506 Fall 2007/wiki1 12 dp3

2007-09-11T03:57:00Z

Dtiwari2: /* Scalability */

Sections 1.3.3 and 1.3.4: Most changes here are probably related to performance metrics. Cite other models for measuring artifacts such as data-transfer time, overhead, occupancy, and communication cost. Focus on the models that are most useful in practice.

== Communication and Replication ==

In this section, we describe two terms Communication and [http://en.wikipedia.org/wiki/Replication_(computer_science) Replication], simultaneously we also make distinction between these two terms.

Communication between any two [http://en.wikipedia.org/wiki/Process_(computing) processes] is said to occur when data written by one process is read by second process. This causes a data transfer between the processes however, if data is just stored at one process (because initially data was configured to be on this process or it was too large to fit at any other place) and transfer only makes another copy of the data at second process then it called replication.
For example, on processor’s request of data if we copy something from main memory and put it in cache this operation is replication of data. On the contrast if a data is produced by a sender process and it is transferred to a receiver process by message passing then it is an example of communication.

Communication and replication both involves data transfer, which can be defined as transfer of data across different memory locations. For interprocess(or) communication the data is transferred across the memory local to the communicating processor or from a remote storage device. When a miss occurs in cache, the data is transferred from the memory to the cache. In case, where the cache content, as a result of replication, is updated or changed, these changes must be transported to all the other hidden replicas. This is another aspect of data transfer.

== Performance ==

=== Introduction ===

In this section, we briefly discuss importance of performance measurement in parallel computer architecture and basic performance metrics.

As we already know, performance measurement is one of the fundamental issues in [http://en.wikipedia.org/wiki/Uniprocessor uniprocessor system] where architects focus on improving performance by reducing execution time of standard programs called benchmarks. They use several techniques such as minimizing memory access time, designing hardware which can execute many instruction in parallel and possibly faster ([http://en.wikipedia.org/wiki/Instruction_level_parallelism micro level parallelism] extraction) etc. Performance measurement is more serious concern in parallel computing because apart from computing performance measurement we also need to analyze communication cost as data is shared among many processors and processes (possibly on different processors) need to communicate efficiently, coherently and correctly.

To make our point more precise, let us consider the following example:

:Assume we want to run a program which takes 100sec on uniprocessor. However, we also know that the full program can be decomposed in many processes and these processes can be run on different machine. So basically, in best case we expect the speed up of <tt>n</tt> where <tt>n</tt> is the number of processors available. We have divided the computing load but these processes can not run independently to achieve the completion. To run the program correctly, these processes do need to communicate to each other for data sharing, [http://en.wikipedia.org/wiki/Synchronization synchronization] etc. Hence, there is communication overhead involved in parallel computing.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:P02.jpg Following figure] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) helps in understanding how parallel processers typically spend their execution time on different activities like: local/remote data access, computing, synchronization and other work.

[[Image: P02.jpg| P02.jpg]]

A sequential program takes 100 sec to run the program, in this hypothetical example it is assumed that 80% time is busy-useful (i.e.execution of instructions) time however rest 20% of the time processor spends in accessing local data.
Four parallel processors solve the same problem in 55 seconds (speed up of 1.8 instead of expected speed up of 4). Parallel processors spend time in accessing data at both locations : remote and local. These processors execute instructions which we call busy useful time and moreover they synchronize with each other to execute the program correctly. However in such a parallel computing environment, processors execute some instructions/work which are not needed if program is run sequentially, time spent in such activities is called busy-overhead time. Such type of work is also called 'extra work' which we will discuss shortly.

Clearly, a wise architect would not like to have any parallel system where communication overheads overwhelm speed up achieved by dividing the computing-load. In other words, diving computing load on different processors is good idea only when communication costs do not shoot up too much. Similarly, in order to reduce communication cost one should not kill the inherent parallelism available in the program.

[http://en.wikipedia.org/wiki/Speedup Speedup] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) gain by parallel computing has to take into the account the synchronization time, communication cost and extra work apart from computing work. So, speed up can be expressed as following:

[[Image: P01.jpg| P01.jpg]]

In the equation above, 'extra work' is work done by processors other than computation, synchronization and communication. This might include:
:Computing a good partition for a particular problem.
:Using redundant computation to avoid communication.
:Task, data and process management overhead etc.

Hence, it is obvious that in order to increase the speed up (improve performance), architects would focus on all the factors appearing in the denominator of above equation. Therefore, we need to consider various design trade-offs while analyzing performance of parallel computing architecture.

There are three basic performance metrics.
* [http://en.wikipedia.org/wiki/Latency_(engineering) Latency] : Time taken by an operation to get completed. ( measured as seconds per operation)
* [http://en.wikipedia.org/wiki/Bandwidth Bandwidth]: The rate at which operations are executed. (measured as operations per second)
* Cost: Cost is basically impact of operations on total execution time of the program. (measured as latency times number of operations)

In uniprocessor system, bandwidth is simply reciprocal of latency however, in parallel computing many operations are performed concurrently so relationship among performance metrics is not simple.In parallel computing, we need to consider the performance for communication operations along with computing operations.

We list three [http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Artifacts_of_Measuring_Performance artifacts of measuring performance] and since data transfer operations are the most frequent type of communication operation, discussion on the same appears first.

=== Artifacts of Measuring Performance ===

*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Data_Transfer Data Transfer]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Overhead_and_Occupancy Overhead and Occupancy]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Communication_Cost Communication Cost]

==== Data Transfer ====

For any data transfer we would like to estimate the time it consumes, so that we can improve the overall performance of the system by reducing data transfer time. To estimate the data transfer time a simple ''[http://en.wikipedia.org/wiki/Linear_model linear model]'' is used ( [http://www.cs.berkeley.edu/~culler/cs258-s99/#lectures referenced lecture material]):

<tt><center> '''Total Transfer Time (T) = Start-up Time (T0) + Data Transfer Time (TD)'''</center> </tt>

Total transfer time has two components:
1. A constant term (T0) which is called start up cost. We will shortly return to this with more details.
2. Data transfer time, which is estimated as following:

<tt><center>'''Data Transfer Time (TD)) = Size of Data (n) / Bandwidth (B)'''</center></tt>

Bandwidth (B) is also called data transfer rate.

To have better understanding of the model, we should be clear about the following points:

* If we have only one pair of host then data transfer rate is simply the bandwidth of the link connecting those hosts.
* However if there are many hosts between the source host and destination host, bottleneck is the link with lowest bandwidth.

Important point to note is that the '''achievable bandwidth''' depends on the transfer size, that is why sometimes bandwidth (<tt>'''B'''</tt>) is called as '''peak bandwidth'''. For example:

::Suppose we have two hosts connected by a link with bandwidth of 20MB/s and start up cost of communication is 2 micro seconds. We want to transfer an image of size 40MB then the total transfer time is 2 seconds plus 2 micro seconds. Given the available peak bandwidth of 20MB/s, one might have expected to complete the transfer in 2 seconds achieving the peak bandwidth but start up cost prohibits this. Clearly as you increase the amount of data achievable bandwidth approaches the asymptotic rate of bandwidth (B), in fact start up cost determines how fast the asymptotic rate would be achieved.

As a special case, the amount of data required to achieve half of peak bandwidth (<tt>'''B'''</tt>) is equal to <tt>'''T0 X B'''</tt>. This is also called '''half-power point'''. Please note that printed version of the [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 text-book] has erroneous formula for calculating half-power point.

Now, we discuss first part of total transfer-time i.e. start-up cost. Notice that <tt>'''T0'''</tt> is a constant term for a particular data transfer,but it might vary as we consider data transfer over different entities. ''For example'', in memory operation start up cost is memory access time. In message passing the start up cost can be estimated as time taken by fist bit to reach destination. For pipelined operations, start up cost is simply time taken to fill up the pipeline. For bus transactions it is arbitration and command phases. 

As parallel computing has advanced, one of the major focuses has been to ''reduce start up cost''. There are many ways to do so; we describe few of them here. As stated earlier start up cost for memory operations is basically the memory access time. To reduce memory access time, architects have introduced costly (hence small size) but fast storage area called [http://en.wikipedia.org/wiki/Cache cache]. Depending upon the [http://en.wikipedia.org/wiki/Memory_locality spatial and temporal locality] cache is filled with useful items and hence processor does not have to go to memory (long [http://en.wikipedia.org/wiki/Memory_latency latency] ) every time it needs data.
Average access time is governed by the following formula [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]:

<tt><center> '''Average memory access time = Hit time for the cache + (Miss Rate X Miss Penalty)'''</center></tt>

Therefore architects often try to reduce all three components by adopting [http://citeseer.ist.psu.edu/kowarschik03overview.html different cache optimization] like: multilevel cache, larger blocks size, higher associatively etc.

We quote the access time of cache and main memory to see how beneficial it might be to introduce cache if we manage to get considerably high hit rates.
Cache access time is typically 0.5-25 ns while for the main memory it is 50-250 ns, so we can decrease start-up cost considerably by having such a memory hierarchy. Bandwidth for caches range around 5,000-20,000 MB/sec but for memory its as low as 2,500-10,000MB/sec. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]

Similarly for ''bus transactions'', start up cost is time spent during arbitration and command phases, suppose on a 3.6 GHz Intel Xeon machine (year 2004) it takes 3 cycles to arbitrate the bus and present the address the start up cost is around 0.83 nano seconds. However assuming that around year 1994-95 it took same 3 cycles on Alpha 21064A 0.3 GHz processor we can see that start up cost has been reduced by more than 10 times. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Chapter 1]

''Pipelining'' is another way to reduce the data transfer time, for pipelined systems filling up the pipeline is the total start up cost. Though it seems that introducing pipeline adds extra start up cost, however more importantly pipeline allows multiple operations to take place concurrently and this indeed helps in achieving higher bandwidth.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix A]

Startup cost calculation is many times challenging, with enhancement in technology focus is too reduce the start-up cost and increase bandwidth.

[http://pg.ece.ncsu.edu/mediawiki/images/3/37/P03.jpg Next plot] (on log-log scale) shows time for message passing operation for several machines as a function of message size [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 source]. We notice the start-up costs for different machines vary a lot (nearly spread over an order of magnitude). These start up costs vary a lot and total transfer size is non linear function of message size for small amount of data, contrary to linear data transfer model. For big size data transfer we get almost linear relationship. The bandwidth can be calculated by slope of the line.

[[Image:P03.jpg | Time for message passing operation versus message size ]],

Quoting other figures,

'''iPSC/2''' machine has start up cost of 700 micro seconds and '''CRAY T3D (PVM)''' machine has start up cost of 21 micro seconds, we can clearly see the trends that within one decade start up cost has dropped by an order of magnitude. '''NOW''' machine has start up cost of just 16 micro seconds. Basically these improvements are essentially due to improvement in cycle time.

Similar trends can be observed in 'Maximum Bandwidth' achievable on each machine for data transfer. For '''nCUBE/2''' machine maximum bandwidth was 2 MB/s but relatively advanced machines like '''SP2''' has bandwidth of 40 MB/s.

: However, this data transfer model has few shortcomings too.
This model does not indicate when the next operation can be performed. Estimating time interval between two operations is particularly very useful because bandwidth depends on how frequently operations can be initiated.
This model also does not tell about whether other useful work can be done during transfer or not.
This model is easy to understand but it is not very suitable for architectural evaluation. For network transactions, total message time is difficult to measure unless there is a global clock as the send and receive usually happen on different processors.So, transaction time is usually measured by doing a echo test (i.e. one processor sends the data and waits until it receives a message). But this is reliable only if receive is posted before message arrives hence measuring transaction time is very challenging (and not always accurate) in this transfer model.

::In this section we discussed different aspects of data transfer model.In parallel computing environment, data transfers usually take place across the network and it is invoked by processor through the communication assist. Therefore, now we need to look at how communication costs are estimated and what are the important factors to consider?

==== Overhead and Occupancy ====

One of the three components of Processor execution time, apart from ''Computation Time and Idle Time'', is ''Communication Time''. Communication time is the time spent by the processor on exchanging messages with another process(or). There can be two different types of communication, i.e. interprocessor and intraprocessor. In interprocessor communication the two communicating tasks are handled by two different processors. While in intraprocessor communication the communicating tasks are handled by the same processor. Generally both intraprocessor and interprocessor communication costs the same, provided the former is not highly optimized.

Communication time is function of number of bytes (n) transferred across. It can be given as below

<tt><center> '''T(n) = Overhead + Occupancy + Network Delay + Message Size / Bandwidth + Delay due to Contention'''</center></tt>

[[Image:time.jpg]]

Source [http://www.cs.berkeley.edu/~culler/cs258-s99/ Lecture :Culler]

Communication Overhead includes time spent on
*Create messages
*Execute communications protocols
*Physically send messages
*Run through the protocol sets and decode the message on the receiving node.

During this period the processor cannot do any useful or computational work. Parallel programs, running on different processors need to coordinate their work among themselves. This results in increased rate of interprocessor communication, which in turn increases the net overhead cost.

Occupancy is the time spent at the slowest component in the communication assist and it affects performance in couple of ways. It delays the current request and indirectly contributes to the delays of subsequent requests. The occupancy gets to set the upper limit on how frequently communication operation can be initiated by the processor.

''Some of the recent trends/designs helped reduce these communication costs. [http://en.wikipedia.org/wiki/Blue_Gene IBM Blue Gene (L)] uses [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Collective Network], which carries out operations within the network itself. This saves the processors time to decode messages with intermediate values, calculate new intermediate values, create new messages, and send them on to other nodes. The overhead now is primarily because of communication protocol. It also has a dedicated communications network, [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Barrier and Interrupt Network], to speed up task-to-task coordination activities. IBM BG/L also employs [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Torus Network], which results in linear growth of the path length while nodes (processors) scale as a cube. Torus Network also gives the ability to send messages in either direction, something like a ring network and hence reduces the distance between furthest points to half. This in turn reduces the network delays.''

[[Image:torus1.jpg]][[Image:global.jpg]][[Image:giga.jpg]]

(a) Three-dimensional Torus. (b) Global Collective Network. (c) IBM BG/L Control Network & Gigabit Ethernet networks

Source [http://www.research.ibm.com/journal/rd/492/gara.pdf IBM BG/L]

''IBM Blue Gene employs simultaneous send/receive technique in the torus network. Hence if there are N numbers of nodes, then a single node can send/receive with 2N other nodes simultaneously along its 2N different links.''

[[Image:torus.jpg]]

''If the cost of a single send is given by Ts (without simultaneous send) then for ‘S’ simultaneous sends the total cost becomes''
<center>Ts + Ts x (S – 1) x f where f: 0 < f < 1 </center>
''The speedup is 1/f approximately. Below is the performance comparison of an algorithm (used for image processing in microscopy application) running on Blue Gene/L versus an Intel Linux cluster which does not employs simultaneous send.''

[[Image:bll.jpg]]
*source: [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1924514 BMC Cell Biology]

''For Linux Cluster the computation time is very high percentage (up to 40%) of the total time. In Blue Gene it is as less as 5% of total runtime.''

From processor’s point of view there are number of other network delays which can be categorized as occupancy. Contention for resources can be viewed as one of the occupancies. The net bandwidth reduces as a result of this. If P concurrent processors are using a network of Bandwidth B, then the effective bandwidth would be B/P. Contention are basically two types. When it is due to routers and switches it is called network contention. If it is observed at endpoints or processing nodes it is called endpoint contention. When the contention of endpoint type occurs, then all the processing nodes involved are called hot spot. This type of contention can easily alleviate in software.

''The global interrupt and barrier network and global tree network operate in parallel which provides global asynchronous sideband signals. This basically results in lower roundtrip latency, as low as 1.3micro seconds. Network contention can always increase the latency. IBM BG/L uses Virtual cut through (VCT) routing technique.''

==== Communication Cost ====

At the end of the day we want to reduce the communication cost. Communication cost is given by following equation:

<tt><center>'''Communication cost = Frequency of communication x (Communication Time – Overlap)'''</center></tt>

Frequency of communication is self explanatory, which depends on the machine architecture and program design. Some architecture like scale-up symmetric multiprocessing (SMP) and scale-out massively parallel processing (MPP) [http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx (Microsoft Solution: HPC)] systems supports tightly coupled parallel applications. This result in high frequency of communication, which makes it important to have the other parts of communication time like overhead and network delays to be small. Loosely coupled parallel applications, on inherently parallel system architecture, requires minimal inter process communication.

The portion of the communication operation which is performed concurrently with processor engaged in other useful work (computation and other communication) is the overlap. This concept is exploited to obtain high throughput. For instance, each node of IBM BG/L has IBM CMOS ASIC. Each of this ASIC has two independent cores (microprocessor). Virtually there is no difference in the core, each processor can handle its own communication or one processor can be used for communication and another for computation. This way very high degree of overlap is achieved.

=== Scalability ===

Scalability of parallel computer is so important performance metrics that it was worth giving a heading for this. A general perception is that by increasing the number of processors arbitrarily, performance increases. This is not absolutely true while calculating parallel computer performance. Scalability means there exist an isoefficiency function for a parallel system such that upon increasing the size of problem the efficiency remains same. Scalability is bounded by two different limits. Weak scaling- when the load on individual processor remains same but the number of processor is increased. Strong Scaling- the problem size remaining same but load on individual processor is reduced while increasing the processor count. Generally all problem lies in between these two limits.

Is there a limit to the number of processor? Amdahl's law gives a picture of how performance is affected by increasing the number of processor. If a problem size is fixed and if it takes execution time T in the uni-processor system then on parallel system with P processors it will take
<center>T x q + (1 – q) x T / P</center>

Where T x q is time taken by sequential part of the program. Then the speed up is
<center>S = 1 / (q + (1 – q)/P)</center>
This upon simulating gives following result

[[Image:amdhals.jpg]]

Source: Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

It can be concluded from the graph that not necessarily all algorithms will produce high speedups. It depends what we running. Scalability is thus dependent on the application, and hence while scaling up a system the application area needs to be specified.

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4.[http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx Microsoft Solution: High Performance Computing]

5.Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

6.[http://www.research.ibm.com/journal/rd/492/gara.html IBM Blue Gene/L]

CSC/ECE 506 Fall 2007/wiki1 12 dp3

2007-09-11T03:54:43Z

Dtiwari2: /* Bibiliography */

Sections 1.3.3 and 1.3.4: Most changes here are probably related to performance metrics. Cite other models for measuring artifacts such as data-transfer time, overhead, occupancy, and communication cost. Focus on the models that are most useful in practice.

== Communication and Replication ==

In this section, we describe two terms Communication and [http://en.wikipedia.org/wiki/Replication_(computer_science) Replication], simultaneously we also make distinction between these two terms.

Communication between any two [http://en.wikipedia.org/wiki/Process_(computing) processes] is said to occur when data written by one process is read by second process. This causes a data transfer between the processes however, if data is just stored at one process (because initially data was configured to be on this process or it was too large to fit at any other place) and transfer only makes another copy of the data at second process then it called replication.
For example, on processor’s request of data if we copy something from main memory and put it in cache this operation is replication of data. On the contrast if a data is produced by a sender process and it is transferred to a receiver process by message passing then it is an example of communication.

Communication and replication both involves data transfer, which can be defined as transfer of data across different memory locations. For interprocess(or) communication the data is transferred across the memory local to the communicating processor or from a remote storage device. When a miss occurs in cache, the data is transferred from the memory to the cache. In case, where the cache content, as a result of replication, is updated or changed, these changes must be transported to all the other hidden replicas. This is another aspect of data transfer.

== Performance ==

=== Introduction ===

In this section, we briefly discuss importance of performance measurement in parallel computer architecture and basic performance metrics.

As we already know, performance measurement is one of the fundamental issues in [http://en.wikipedia.org/wiki/Uniprocessor uniprocessor system] where architects focus on improving performance by reducing execution time of standard programs called benchmarks. They use several techniques such as minimizing memory access time, designing hardware which can execute many instruction in parallel and possibly faster ([http://en.wikipedia.org/wiki/Instruction_level_parallelism micro level parallelism] extraction) etc. Performance measurement is more serious concern in parallel computing because apart from computing performance measurement we also need to analyze communication cost as data is shared among many processors and processes (possibly on different processors) need to communicate efficiently, coherently and correctly.

To make our point more precise, let us consider the following example:

:Assume we want to run a program which takes 100sec on uniprocessor. However, we also know that the full program can be decomposed in many processes and these processes can be run on different machine. So basically, in best case we expect the speed up of <tt>n</tt> where <tt>n</tt> is the number of processors available. We have divided the computing load but these processes can not run independently to achieve the completion. To run the program correctly, these processes do need to communicate to each other for data sharing, [http://en.wikipedia.org/wiki/Synchronization synchronization] etc. Hence, there is communication overhead involved in parallel computing.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:P02.jpg Following figure] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) helps in understanding how parallel processers typically spend their execution time on different activities like: local/remote data access, computing, synchronization and other work.

[[Image: P02.jpg| P02.jpg]]

A sequential program takes 100 sec to run the program, in this hypothetical example it is assumed that 80% time is busy-useful (i.e.execution of instructions) time however rest 20% of the time processor spends in accessing local data.
Four parallel processors solve the same problem in 55 seconds (speed up of 1.8 instead of expected speed up of 4). Parallel processors spend time in accessing data at both locations : remote and local. These processors execute instructions which we call busy useful time and moreover they synchronize with each other to execute the program correctly. However in such a parallel computing environment, processors execute some instructions/work which are not needed if program is run sequentially, time spent in such activities is called busy-overhead time. Such type of work is also called 'extra work' which we will discuss shortly.

Clearly, a wise architect would not like to have any parallel system where communication overheads overwhelm speed up achieved by dividing the computing-load. In other words, diving computing load on different processors is good idea only when communication costs do not shoot up too much. Similarly, in order to reduce communication cost one should not kill the inherent parallelism available in the program.

[http://en.wikipedia.org/wiki/Speedup Speedup] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) gain by parallel computing has to take into the account the synchronization time, communication cost and extra work apart from computing work. So, speed up can be expressed as following:

[[Image: P01.jpg| P01.jpg]]

In the equation above, 'extra work' is work done by processors other than computation, synchronization and communication. This might include:
:Computing a good partition for a particular problem.
:Using redundant computation to avoid communication.
:Task, data and process management overhead etc.

Hence, it is obvious that in order to increase the speed up (improve performance), architects would focus on all the factors appearing in the denominator of above equation. Therefore, we need to consider various design trade-offs while analyzing performance of parallel computing architecture.

There are three basic performance metrics.
* [http://en.wikipedia.org/wiki/Latency_(engineering) Latency] : Time taken by an operation to get completed. ( measured as seconds per operation)
* [http://en.wikipedia.org/wiki/Bandwidth Bandwidth]: The rate at which operations are executed. (measured as operations per second)
* Cost: Cost is basically impact of operations on total execution time of the program. (measured as latency times number of operations)

In uniprocessor system, bandwidth is simply reciprocal of latency however, in parallel computing many operations are performed concurrently so relationship among performance metrics is not simple.In parallel computing, we need to consider the performance for communication operations along with computing operations.

We list three [http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Artifacts_of_Measuring_Performance artifacts of measuring performance] and since data transfer operations are the most frequent type of communication operation, discussion on the same appears first.

=== Artifacts of Measuring Performance ===

*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Data_Transfer Data Transfer]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Overhead_and_Occupancy Overhead and Occupancy]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Communication_Cost Communication Cost]

==== Data Transfer ====

For any data transfer we would like to estimate the time it consumes, so that we can improve the overall performance of the system by reducing data transfer time. To estimate the data transfer time a simple ''[http://en.wikipedia.org/wiki/Linear_model linear model]'' is used ( [http://www.cs.berkeley.edu/~culler/cs258-s99/#lectures referenced lecture material]):

<tt><center> '''Total Transfer Time (T) = Start-up Time (T0) + Data Transfer Time (TD)'''</center> </tt>

Total transfer time has two components:
1. A constant term (T0) which is called start up cost. We will shortly return to this with more details.
2. Data transfer time, which is estimated as following:

<tt><center>'''Data Transfer Time (TD)) = Size of Data (n) / Bandwidth (B)'''</center></tt>

Bandwidth (B) is also called data transfer rate.

To have better understanding of the model, we should be clear about the following points:

* If we have only one pair of host then data transfer rate is simply the bandwidth of the link connecting those hosts.
* However if there are many hosts between the source host and destination host, bottleneck is the link with lowest bandwidth.

Important point to note is that the '''achievable bandwidth''' depends on the transfer size, that is why sometimes bandwidth (<tt>'''B'''</tt>) is called as '''peak bandwidth'''. For example:

::Suppose we have two hosts connected by a link with bandwidth of 20MB/s and start up cost of communication is 2 micro seconds. We want to transfer an image of size 40MB then the total transfer time is 2 seconds plus 2 micro seconds. Given the available peak bandwidth of 20MB/s, one might have expected to complete the transfer in 2 seconds achieving the peak bandwidth but start up cost prohibits this. Clearly as you increase the amount of data achievable bandwidth approaches the asymptotic rate of bandwidth (B), in fact start up cost determines how fast the asymptotic rate would be achieved.

As a special case, the amount of data required to achieve half of peak bandwidth (<tt>'''B'''</tt>) is equal to <tt>'''T0 X B'''</tt>. This is also called '''half-power point'''. Please note that printed version of the [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 text-book] has erroneous formula for calculating half-power point.

Now, we discuss first part of total transfer-time i.e. start-up cost. Notice that <tt>'''T0'''</tt> is a constant term for a particular data transfer,but it might vary as we consider data transfer over different entities. ''For example'', in memory operation start up cost is memory access time. In message passing the start up cost can be estimated as time taken by fist bit to reach destination. For pipelined operations, start up cost is simply time taken to fill up the pipeline. For bus transactions it is arbitration and command phases. 

As parallel computing has advanced, one of the major focuses has been to ''reduce start up cost''. There are many ways to do so; we describe few of them here. As stated earlier start up cost for memory operations is basically the memory access time. To reduce memory access time, architects have introduced costly (hence small size) but fast storage area called [http://en.wikipedia.org/wiki/Cache cache]. Depending upon the [http://en.wikipedia.org/wiki/Memory_locality spatial and temporal locality] cache is filled with useful items and hence processor does not have to go to memory (long [http://en.wikipedia.org/wiki/Memory_latency latency] ) every time it needs data.
Average access time is governed by the following formula [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]:

<tt><center> '''Average memory access time = Hit time for the cache + (Miss Rate X Miss Penalty)'''</center></tt>

Therefore architects often try to reduce all three components by adopting [http://citeseer.ist.psu.edu/kowarschik03overview.html different cache optimization] like: multilevel cache, larger blocks size, higher associatively etc.

We quote the access time of cache and main memory to see how beneficial it might be to introduce cache if we manage to get considerably high hit rates.
Cache access time is typically 0.5-25 ns while for the main memory it is 50-250 ns, so we can decrease start-up cost considerably by having such a memory hierarchy. Bandwidth for caches range around 5,000-20,000 MB/sec but for memory its as low as 2,500-10,000MB/sec. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]

Similarly for ''bus transactions'', start up cost is time spent during arbitration and command phases, suppose on a 3.6 GHz Intel Xeon machine (year 2004) it takes 3 cycles to arbitrate the bus and present the address the start up cost is around 0.83 nano seconds. However assuming that around year 1994-95 it took same 3 cycles on Alpha 21064A 0.3 GHz processor we can see that start up cost has been reduced by more than 10 times. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Chapter 1]

''Pipelining'' is another way to reduce the data transfer time, for pipelined systems filling up the pipeline is the total start up cost. Though it seems that introducing pipeline adds extra start up cost, however more importantly pipeline allows multiple operations to take place concurrently and this indeed helps in achieving higher bandwidth.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix A]

Startup cost calculation is many times challenging, with enhancement in technology focus is too reduce the start-up cost and increase bandwidth.

[http://pg.ece.ncsu.edu/mediawiki/images/3/37/P03.jpg Next plot] (on log-log scale) shows time for message passing operation for several machines as a function of message size [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 source]. We notice the start-up costs for different machines vary a lot (nearly spread over an order of magnitude). These start up costs vary a lot and total transfer size is non linear function of message size for small amount of data, contrary to linear data transfer model. For big size data transfer we get almost linear relationship. The bandwidth can be calculated by slope of the line.

[[Image:P03.jpg | Time for message passing operation versus message size ]],

Quoting other figures,

'''iPSC/2''' machine has start up cost of 700 micro seconds and '''CRAY T3D (PVM)''' machine has start up cost of 21 micro seconds, we can clearly see the trends that within one decade start up cost has dropped by an order of magnitude. '''NOW''' machine has start up cost of just 16 micro seconds. Basically these improvements are essentially due to improvement in cycle time.

Similar trends can be observed in 'Maximum Bandwidth' achievable on each machine for data transfer. For '''nCUBE/2''' machine maximum bandwidth was 2 MB/s but relatively advanced machines like '''SP2''' has bandwidth of 40 MB/s.

: However, this data transfer model has few shortcomings too.
This model does not indicate when the next operation can be performed. Estimating time interval between two operations is particularly very useful because bandwidth depends on how frequently operations can be initiated.
This model also does not tell about whether other useful work can be done during transfer or not.
This model is easy to understand but it is not very suitable for architectural evaluation. For network transactions, total message time is difficult to measure unless there is a global clock as the send and receive usually happen on different processors.So, transaction time is usually measured by doing a echo test (i.e. one processor sends the data and waits until it receives a message). But this is reliable only if receive is posted before message arrives hence measuring transaction time is very challenging (and not always accurate) in this transfer model.

::In this section we discussed different aspects of data transfer model.In parallel computing environment, data transfers usually take place across the network and it is invoked by processor through the communication assist. Therefore, now we need to look at how communication costs are estimated and what are the important factors to consider?

==== Overhead and Occupancy ====

One of the three components of Processor execution time, apart from ''Computation Time and Idle Time'', is ''Communication Time''. Communication time is the time spent by the processor on exchanging messages with another process(or). There can be two different types of communication, i.e. interprocessor and intraprocessor. In interprocessor communication the two communicating tasks are handled by two different processors. While in intraprocessor communication the communicating tasks are handled by the same processor. Generally both intraprocessor and interprocessor communication costs the same, provided the former is not highly optimized.

Communication time is function of number of bytes (n) transferred across. It can be given as below

<tt><center> '''T(n) = Overhead + Occupancy + Network Delay + Message Size / Bandwidth + Delay due to Contention'''</center></tt>

[[Image:time.jpg]]

Source [http://www.cs.berkeley.edu/~culler/cs258-s99/ Lecture :Culler]

Communication Overhead includes time spent on
*Create messages
*Execute communications protocols
*Physically send messages
*Run through the protocol sets and decode the message on the receiving node.

During this period the processor cannot do any useful or computational work. Parallel programs, running on different processors need to coordinate their work among themselves. This results in increased rate of interprocessor communication, which in turn increases the net overhead cost.

Occupancy is the time spent at the slowest component in the communication assist and it affects performance in couple of ways. It delays the current request and indirectly contributes to the delays of subsequent requests. The occupancy gets to set the upper limit on how frequently communication operation can be initiated by the processor.

''Some of the recent trends/designs helped reduce these communication costs. [http://en.wikipedia.org/wiki/Blue_Gene IBM Blue Gene (L)] uses [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Collective Network], which carries out operations within the network itself. This saves the processors time to decode messages with intermediate values, calculate new intermediate values, create new messages, and send them on to other nodes. The overhead now is primarily because of communication protocol. It also has a dedicated communications network, [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Barrier and Interrupt Network], to speed up task-to-task coordination activities. IBM BG/L also employs [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Torus Network], which results in linear growth of the path length while nodes (processors) scale as a cube. Torus Network also gives the ability to send messages in either direction, something like a ring network and hence reduces the distance between furthest points to half. This in turn reduces the network delays.''

[[Image:torus1.jpg]][[Image:global.jpg]][[Image:giga.jpg]]

(a) Three-dimensional Torus. (b) Global Collective Network. (c) IBM BG/L Control Network & Gigabit Ethernet networks

Source [http://www.research.ibm.com/journal/rd/492/gara.pdf IBM BG/L]

''IBM Blue Gene employs simultaneous send/receive technique in the torus network. Hence if there are N numbers of nodes, then a single node can send/receive with 2N other nodes simultaneously along its 2N different links.''

[[Image:torus.jpg]]

''If the cost of a single send is given by Ts (without simultaneous send) then for ‘S’ simultaneous sends the total cost becomes''
<center>Ts + Ts x (S – 1) x f where f: 0 < f < 1 </center>
''The speedup is 1/f approximately. Below is the performance comparison of an algorithm (used for image processing in microscopy application) running on Blue Gene/L versus an Intel Linux cluster which does not employs simultaneous send.''

[[Image:bll.jpg]]
*source: [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1924514 BMC Cell Biology]

''For Linux Cluster the computation time is very high percentage (up to 40%) of the total time. In Blue Gene it is as less as 5% of total runtime.''

From processor’s point of view there are number of other network delays which can be categorized as occupancy. Contention for resources can be viewed as one of the occupancies. The net bandwidth reduces as a result of this. If P concurrent processors are using a network of Bandwidth B, then the effective bandwidth would be B/P. Contention are basically two types. When it is due to routers and switches it is called network contention. If it is observed at endpoints or processing nodes it is called endpoint contention. When the contention of endpoint type occurs, then all the processing nodes involved are called hot spot. This type of contention can easily alleviate in software.

''The global interrupt and barrier network and global tree network operate in parallel which provides global asynchronous sideband signals. This basically results in lower roundtrip latency, as low as 1.3micro seconds. Network contention can always increase the latency. IBM BG/L uses Virtual cut through (VCT) routing technique.''

==== Communication Cost ====

At the end of the day we want to reduce the communication cost. Communication cost is given by following equation:

<tt><center>'''Communication cost = Frequency of communication x (Communication Time – Overlap)'''</center></tt>

Frequency of communication is self explanatory, which depends on the machine architecture and program design. Some architecture like scale-up symmetric multiprocessing (SMP) and scale-out massively parallel processing (MPP) [http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx (Microsoft Solution: HPC)] systems supports tightly coupled parallel applications. This result in high frequency of communication, which makes it important to have the other parts of communication time like overhead and network delays to be small. Loosely coupled parallel applications, on inherently parallel system architecture, requires minimal inter process communication.

The portion of the communication operation which is performed concurrently with processor engaged in other useful work (computation and other communication) is the overlap. This concept is exploited to obtain high throughput. For instance, each node of IBM BG/L has IBM CMOS ASIC. Each of this ASIC has two independent cores (microprocessor). Virtually there is no difference in the core, each processor can handle its own communication or one processor can be used for communication and another for computation. This way very high degree of overlap is achieved.

=== Scalability ===

Scalability of parallel computer is so important performance metrics that it was worth giving a heading for this. A general perception is that by increasing the number of processors arbitrarily, performance increases. This is not absolutely true while calculating parallel computer performance. Scalability means there exist an isoefficiency function for a parallel system such that upon increasing the size of problem the efficiency remains same. Scalability is bounded by two different limits. Weak scaling- when the load on individual processor remains same but the number of processor is increased. Strong Scaling- the problem size remaining same but load on individual processor is reduced while increasing the processor count. Generally all problem lies in between these two limits.

Is there a limit to the number of processor? Amdahl's law gives a picture of how performance is affected by increasing the number of processor. If a problem size is fixed and if it takes execution time T in the uni-processor system then on parallel system with P processors it will take
<center>T x q + (1 – q) x T / P</center>

Where T x q is time taken by sequential part of the program. Then the speed up is
<center>S = 1 / (q + (1 – q)/P)</center>
This upon simulating gives following result

[[Image:amdhals.jpg]]

Source: Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

It can be concluded from the graph that not neccessarily all algorithms will produce high speedups. It depends what we running. Scalability is thus dependent on the application, and hence while scaling up a system the application area needs to be specified.

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4.[http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx Microsoft Solution: High Performance Computing]

5.Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

6.[http://www.research.ibm.com/journal/rd/492/gara.html IBM Blue Gene/L]

CSC/ECE 506 Fall 2007/wiki1 12 dp3

2007-09-11T03:54:08Z

Dtiwari2: /* Data Transfer */

Sections 1.3.3 and 1.3.4: Most changes here are probably related to performance metrics. Cite other models for measuring artifacts such as data-transfer time, overhead, occupancy, and communication cost. Focus on the models that are most useful in practice.

== Communication and Replication ==

In this section, we describe two terms Communication and [http://en.wikipedia.org/wiki/Replication_(computer_science) Replication], simultaneously we also make distinction between these two terms.

Communication between any two [http://en.wikipedia.org/wiki/Process_(computing) processes] is said to occur when data written by one process is read by second process. This causes a data transfer between the processes however, if data is just stored at one process (because initially data was configured to be on this process or it was too large to fit at any other place) and transfer only makes another copy of the data at second process then it called replication.
For example, on processor’s request of data if we copy something from main memory and put it in cache this operation is replication of data. On the contrast if a data is produced by a sender process and it is transferred to a receiver process by message passing then it is an example of communication.

Communication and replication both involves data transfer, which can be defined as transfer of data across different memory locations. For interprocess(or) communication the data is transferred across the memory local to the communicating processor or from a remote storage device. When a miss occurs in cache, the data is transferred from the memory to the cache. In case, where the cache content, as a result of replication, is updated or changed, these changes must be transported to all the other hidden replicas. This is another aspect of data transfer.

== Performance ==

=== Introduction ===

In this section, we briefly discuss importance of performance measurement in parallel computer architecture and basic performance metrics.

As we already know, performance measurement is one of the fundamental issues in [http://en.wikipedia.org/wiki/Uniprocessor uniprocessor system] where architects focus on improving performance by reducing execution time of standard programs called benchmarks. They use several techniques such as minimizing memory access time, designing hardware which can execute many instruction in parallel and possibly faster ([http://en.wikipedia.org/wiki/Instruction_level_parallelism micro level parallelism] extraction) etc. Performance measurement is more serious concern in parallel computing because apart from computing performance measurement we also need to analyze communication cost as data is shared among many processors and processes (possibly on different processors) need to communicate efficiently, coherently and correctly.

To make our point more precise, let us consider the following example:

:Assume we want to run a program which takes 100sec on uniprocessor. However, we also know that the full program can be decomposed in many processes and these processes can be run on different machine. So basically, in best case we expect the speed up of <tt>n</tt> where <tt>n</tt> is the number of processors available. We have divided the computing load but these processes can not run independently to achieve the completion. To run the program correctly, these processes do need to communicate to each other for data sharing, [http://en.wikipedia.org/wiki/Synchronization synchronization] etc. Hence, there is communication overhead involved in parallel computing.

[http://pg.ece.ncsu.edu/mediawiki/index.php/Image:P02.jpg Following figure] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) helps in understanding how parallel processers typically spend their execution time on different activities like: local/remote data access, computing, synchronization and other work.

[[Image: P02.jpg| P02.jpg]]

A sequential program takes 100 sec to run the program, in this hypothetical example it is assumed that 80% time is busy-useful (i.e.execution of instructions) time however rest 20% of the time processor spends in accessing local data.
Four parallel processors solve the same problem in 55 seconds (speed up of 1.8 instead of expected speed up of 4). Parallel processors spend time in accessing data at both locations : remote and local. These processors execute instructions which we call busy useful time and moreover they synchronize with each other to execute the program correctly. However in such a parallel computing environment, processors execute some instructions/work which are not needed if program is run sequentially, time spent in such activities is called busy-overhead time. Such type of work is also called 'extra work' which we will discuss shortly.

Clearly, a wise architect would not like to have any parallel system where communication overheads overwhelm speed up achieved by dividing the computing-load. In other words, diving computing load on different processors is good idea only when communication costs do not shoot up too much. Similarly, in order to reduce communication cost one should not kill the inherent parallelism available in the program.

[http://en.wikipedia.org/wiki/Speedup Speedup] ([http://www.cs.berkeley.edu/~culler/cs258-s99/ reference]) gain by parallel computing has to take into the account the synchronization time, communication cost and extra work apart from computing work. So, speed up can be expressed as following:

[[Image: P01.jpg| P01.jpg]]

In the equation above, 'extra work' is work done by processors other than computation, synchronization and communication. This might include:
:Computing a good partition for a particular problem.
:Using redundant computation to avoid communication.
:Task, data and process management overhead etc.

Hence, it is obvious that in order to increase the speed up (improve performance), architects would focus on all the factors appearing in the denominator of above equation. Therefore, we need to consider various design trade-offs while analyzing performance of parallel computing architecture.

There are three basic performance metrics.
* [http://en.wikipedia.org/wiki/Latency_(engineering) Latency] : Time taken by an operation to get completed. ( measured as seconds per operation)
* [http://en.wikipedia.org/wiki/Bandwidth Bandwidth]: The rate at which operations are executed. (measured as operations per second)
* Cost: Cost is basically impact of operations on total execution time of the program. (measured as latency times number of operations)

In uniprocessor system, bandwidth is simply reciprocal of latency however, in parallel computing many operations are performed concurrently so relationship among performance metrics is not simple.In parallel computing, we need to consider the performance for communication operations along with computing operations.

We list three [http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Artifacts_of_Measuring_Performance artifacts of measuring performance] and since data transfer operations are the most frequent type of communication operation, discussion on the same appears first.

=== Artifacts of Measuring Performance ===

*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Data_Transfer Data Transfer]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Overhead_and_Occupancy Overhead and Occupancy]
*[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_12_dp3#Communication_Cost Communication Cost]

==== Data Transfer ====

For any data transfer we would like to estimate the time it consumes, so that we can improve the overall performance of the system by reducing data transfer time. To estimate the data transfer time a simple ''[http://en.wikipedia.org/wiki/Linear_model linear model]'' is used ( [http://www.cs.berkeley.edu/~culler/cs258-s99/#lectures referenced lecture material]):

<tt><center> '''Total Transfer Time (T) = Start-up Time (T0) + Data Transfer Time (TD)'''</center> </tt>

Total transfer time has two components:
1. A constant term (T0) which is called start up cost. We will shortly return to this with more details.
2. Data transfer time, which is estimated as following:

<tt><center>'''Data Transfer Time (TD)) = Size of Data (n) / Bandwidth (B)'''</center></tt>

Bandwidth (B) is also called data transfer rate.

To have better understanding of the model, we should be clear about the following points:

* If we have only one pair of host then data transfer rate is simply the bandwidth of the link connecting those hosts.
* However if there are many hosts between the source host and destination host, bottleneck is the link with lowest bandwidth.

Important point to note is that the '''achievable bandwidth''' depends on the transfer size, that is why sometimes bandwidth (<tt>'''B'''</tt>) is called as '''peak bandwidth'''. For example:

::Suppose we have two hosts connected by a link with bandwidth of 20MB/s and start up cost of communication is 2 micro seconds. We want to transfer an image of size 40MB then the total transfer time is 2 seconds plus 2 micro seconds. Given the available peak bandwidth of 20MB/s, one might have expected to complete the transfer in 2 seconds achieving the peak bandwidth but start up cost prohibits this. Clearly as you increase the amount of data achievable bandwidth approaches the asymptotic rate of bandwidth (B), in fact start up cost determines how fast the asymptotic rate would be achieved.

As a special case, the amount of data required to achieve half of peak bandwidth (<tt>'''B'''</tt>) is equal to <tt>'''T0 X B'''</tt>. This is also called '''half-power point'''. Please note that printed version of the [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 text-book] has erroneous formula for calculating half-power point.

Now, we discuss first part of total transfer-time i.e. start-up cost. Notice that <tt>'''T0'''</tt> is a constant term for a particular data transfer,but it might vary as we consider data transfer over different entities. ''For example'', in memory operation start up cost is memory access time. In message passing the start up cost can be estimated as time taken by fist bit to reach destination. For pipelined operations, start up cost is simply time taken to fill up the pipeline. For bus transactions it is arbitration and command phases. 

As parallel computing has advanced, one of the major focuses has been to ''reduce start up cost''. There are many ways to do so; we describe few of them here. As stated earlier start up cost for memory operations is basically the memory access time. To reduce memory access time, architects have introduced costly (hence small size) but fast storage area called [http://en.wikipedia.org/wiki/Cache cache]. Depending upon the [http://en.wikipedia.org/wiki/Memory_locality spatial and temporal locality] cache is filled with useful items and hence processor does not have to go to memory (long [http://en.wikipedia.org/wiki/Memory_latency latency] ) every time it needs data.
Average access time is governed by the following formula [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]:

<tt><center> '''Average memory access time = Hit time for the cache + (Miss Rate X Miss Penalty)'''</center></tt>

Therefore architects often try to reduce all three components by adopting [http://citeseer.ist.psu.edu/kowarschik03overview.html different cache optimization] like: multilevel cache, larger blocks size, higher associatively etc.

We quote the access time of cache and main memory to see how beneficial it might be to introduce cache if we manage to get considerably high hit rates.
Cache access time is typically 0.5-25 ns while for the main memory it is 50-250 ns, so we can decrease start-up cost considerably by having such a memory hierarchy. Bandwidth for caches range around 5,000-20,000 MB/sec but for memory its as low as 2,500-10,000MB/sec. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix C]

Similarly for ''bus transactions'', start up cost is time spent during arbitration and command phases, suppose on a 3.6 GHz Intel Xeon machine (year 2004) it takes 3 cycles to arbitrate the bus and present the address the start up cost is around 0.83 nano seconds. However assuming that around year 1994-95 it took same 3 cycles on Alpha 21064A 0.3 GHz processor we can see that start up cost has been reduced by more than 10 times. [http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Chapter 1]

''Pipelining'' is another way to reduce the data transfer time, for pipelined systems filling up the pipeline is the total start up cost. Though it seems that introducing pipeline adds extra start up cost, however more importantly pipeline allows multiple operations to take place concurrently and this indeed helps in achieving higher bandwidth.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture: A Quantitative Approach, Appendix A]

Startup cost calculation is many times challenging, with enhancement in technology focus is too reduce the start-up cost and increase bandwidth.

[http://pg.ece.ncsu.edu/mediawiki/images/3/37/P03.jpg Next plot] (on log-log scale) shows time for message passing operation for several machines as a function of message size [http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 source]. We notice the start-up costs for different machines vary a lot (nearly spread over an order of magnitude). These start up costs vary a lot and total transfer size is non linear function of message size for small amount of data, contrary to linear data transfer model. For big size data transfer we get almost linear relationship. The bandwidth can be calculated by slope of the line.

[[Image:P03.jpg | Time for message passing operation versus message size ]],

Quoting other figures,

'''iPSC/2''' machine has start up cost of 700 micro seconds and '''CRAY T3D (PVM)''' machine has start up cost of 21 micro seconds, we can clearly see the trends that within one decade start up cost has dropped by an order of magnitude. '''NOW''' machine has start up cost of just 16 micro seconds. Basically these improvements are essentially due to improvement in cycle time.

Similar trends can be observed in 'Maximum Bandwidth' achievable on each machine for data transfer. For '''nCUBE/2''' machine maximum bandwidth was 2 MB/s but relatively advanced machines like '''SP2''' has bandwidth of 40 MB/s.

: However, this data transfer model has few shortcomings too.
This model does not indicate when the next operation can be performed. Estimating time interval between two operations is particularly very useful because bandwidth depends on how frequently operations can be initiated.
This model also does not tell about whether other useful work can be done during transfer or not.
This model is easy to understand but it is not very suitable for architectural evaluation. For network transactions, total message time is difficult to measure unless there is a global clock as the send and receive usually happen on different processors.So, transaction time is usually measured by doing a echo test (i.e. one processor sends the data and waits until it receives a message). But this is reliable only if receive is posted before message arrives hence measuring transaction time is very challenging (and not always accurate) in this transfer model.

::In this section we discussed different aspects of data transfer model.In parallel computing environment, data transfers usually take place across the network and it is invoked by processor through the communication assist. Therefore, now we need to look at how communication costs are estimated and what are the important factors to consider?

==== Overhead and Occupancy ====

One of the three components of Processor execution time, apart from ''Computation Time and Idle Time'', is ''Communication Time''. Communication time is the time spent by the processor on exchanging messages with another process(or). There can be two different types of communication, i.e. interprocessor and intraprocessor. In interprocessor communication the two communicating tasks are handled by two different processors. While in intraprocessor communication the communicating tasks are handled by the same processor. Generally both intraprocessor and interprocessor communication costs the same, provided the former is not highly optimized.

Communication time is function of number of bytes (n) transferred across. It can be given as below

<tt><center> '''T(n) = Overhead + Occupancy + Network Delay + Message Size / Bandwidth + Delay due to Contention'''</center></tt>

[[Image:time.jpg]]

Source [http://www.cs.berkeley.edu/~culler/cs258-s99/ Lecture :Culler]

Communication Overhead includes time spent on
*Create messages
*Execute communications protocols
*Physically send messages
*Run through the protocol sets and decode the message on the receiving node.

During this period the processor cannot do any useful or computational work. Parallel programs, running on different processors need to coordinate their work among themselves. This results in increased rate of interprocessor communication, which in turn increases the net overhead cost.

Occupancy is the time spent at the slowest component in the communication assist and it affects performance in couple of ways. It delays the current request and indirectly contributes to the delays of subsequent requests. The occupancy gets to set the upper limit on how frequently communication operation can be initiated by the processor.

''Some of the recent trends/designs helped reduce these communication costs. [http://en.wikipedia.org/wiki/Blue_Gene IBM Blue Gene (L)] uses [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Collective Network], which carries out operations within the network itself. This saves the processors time to decode messages with intermediate values, calculate new intermediate values, create new messages, and send them on to other nodes. The overhead now is primarily because of communication protocol. It also has a dedicated communications network, [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Global Barrier and Interrupt Network], to speed up task-to-task coordination activities. IBM BG/L also employs [http://www-03.ibm.com/industries/education/doc/content/bin/WhitepaperIBMBlueGenev7.0.pdf Torus Network], which results in linear growth of the path length while nodes (processors) scale as a cube. Torus Network also gives the ability to send messages in either direction, something like a ring network and hence reduces the distance between furthest points to half. This in turn reduces the network delays.''

[[Image:torus1.jpg]][[Image:global.jpg]][[Image:giga.jpg]]

(a) Three-dimensional Torus. (b) Global Collective Network. (c) IBM BG/L Control Network & Gigabit Ethernet networks

Source [http://www.research.ibm.com/journal/rd/492/gara.pdf IBM BG/L]

''IBM Blue Gene employs simultaneous send/receive technique in the torus network. Hence if there are N numbers of nodes, then a single node can send/receive with 2N other nodes simultaneously along its 2N different links.''

[[Image:torus.jpg]]

''If the cost of a single send is given by Ts (without simultaneous send) then for ‘S’ simultaneous sends the total cost becomes''
<center>Ts + Ts x (S – 1) x f where f: 0 < f < 1 </center>
''The speedup is 1/f approximately. Below is the performance comparison of an algorithm (used for image processing in microscopy application) running on Blue Gene/L versus an Intel Linux cluster which does not employs simultaneous send.''

[[Image:bll.jpg]]
*source: [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1924514 BMC Cell Biology]

''For Linux Cluster the computation time is very high percentage (up to 40%) of the total time. In Blue Gene it is as less as 5% of total runtime.''

From processor’s point of view there are number of other network delays which can be categorized as occupancy. Contention for resources can be viewed as one of the occupancies. The net bandwidth reduces as a result of this. If P concurrent processors are using a network of Bandwidth B, then the effective bandwidth would be B/P. Contention are basically two types. When it is due to routers and switches it is called network contention. If it is observed at endpoints or processing nodes it is called endpoint contention. When the contention of endpoint type occurs, then all the processing nodes involved are called hot spot. This type of contention can easily alleviate in software.

''The global interrupt and barrier network and global tree network operate in parallel which provides global asynchronous sideband signals. This basically results in lower roundtrip latency, as low as 1.3micro seconds. Network contention can always increase the latency. IBM BG/L uses Virtual cut through (VCT) routing technique.''

==== Communication Cost ====

At the end of the day we want to reduce the communication cost. Communication cost is given by following equation:

<tt><center>'''Communication cost = Frequency of communication x (Communication Time – Overlap)'''</center></tt>

Frequency of communication is self explanatory, which depends on the machine architecture and program design. Some architecture like scale-up symmetric multiprocessing (SMP) and scale-out massively parallel processing (MPP) [http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx (Microsoft Solution: HPC)] systems supports tightly coupled parallel applications. This result in high frequency of communication, which makes it important to have the other parts of communication time like overhead and network delays to be small. Loosely coupled parallel applications, on inherently parallel system architecture, requires minimal inter process communication.

The portion of the communication operation which is performed concurrently with processor engaged in other useful work (computation and other communication) is the overlap. This concept is exploited to obtain high throughput. For instance, each node of IBM BG/L has IBM CMOS ASIC. Each of this ASIC has two independent cores (microprocessor). Virtually there is no difference in the core, each processor can handle its own communication or one processor can be used for communication and another for computation. This way very high degree of overlap is achieved.

=== Scalability ===

Scalability of parallel computer is so important performance metrics that it was worth giving a heading for this. A general perception is that by increasing the number of processors arbitrarily, performance increases. This is not absolutely true while calculating parallel computer performance. Scalability means there exist an isoefficiency function for a parallel system such that upon increasing the size of problem the efficiency remains same. Scalability is bounded by two different limits. Weak scaling- when the load on individual processor remains same but the number of processor is increased. Strong Scaling- the problem size remaining same but load on individual processor is reduced while increasing the processor count. Generally all problem lies in between these two limits.

Is there a limit to the number of processor? Amdahl's law gives a picture of how performance is affected by increasing the number of processor. If a problem size is fixed and if it takes execution time T in the uni-processor system then on parallel system with P processors it will take
<center>T x q + (1 – q) x T / P</center>

Where T x q is time taken by sequential part of the program. Then the speed up is
<center>S = 1 / (q + (1 – q)/P)</center>
This upon simulating gives following result

[[Image:amdhals.jpg]]

Source: Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

It can be concluded from the graph that not neccessarily all algorithms will produce high speedups. It depends what we running. Scalability is thus dependent on the application, and hence while scaling up a system the application area needs to be specified.

== Bibiliography ==
1.[http://www.amazon.com/Parallel-Computer-Architecture-Hardware-Software/dp/1558603433 Parallel Computer Architecture: A Hardware/Software Approach by David Culler and J.P. Singh with Anoop Gupta ]

2.[http://www.amazon.com/Computer-Architecture-Fourth-Quantitative-Approach/dp/0123704901 Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson]

3.[http://www.cs.princeton.edu/~jps/ Parallel Computer Architecture Lecture notes By Jaswinder Pal Singh]

4.[http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/hpcunxwn/ch01hpc.mspx Microsoft Solution: High Performance Computing]

5.Lecture Notes: [http://hal.iwr.uni-heidelberg.de/lehre/pcc-05/lecture5.pdf Course on Parallel Computing by Peter Bastian]

6.[http://www.research.ibm.com/journal/rd/492/gara.html IBM Blue Gene/L]

[[Image: test.sxd ]]