CSC/ECE 506 Fall 2007/wiki3 1 ncdt: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 25: Line 25:
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:  
Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:  


[a] SplitScalar: Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.
[a] ''SplitScalar'': Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.


[b] HeapAllocate: Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing
[b] ''HeapAllocate'': Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing


[c] Expand Record: Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.  
[c] ''Expand Record'': Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.  


[d] Align Record: Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.
[d] ''Align Record'': Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.


'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.  
'''2. Dynamic Cache sub-block Design to Reduce False Sharing [5]:''' In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer.  

Revision as of 03:29, 20 October 2007

Introduction

In a multiprocessor using a snoopy coherence protocol, the overall performance is a combination of the behavior of uniprocessor cache miss traffic and the traffic caused by the communication, which results in invalidations and subsequent cache misses. The uniprocessor miss rate is classified into the three C’s (capacity, compulsory, and conflict [1]). Similarly, the misses that arise from interprocessor communication are called coherence misses and can be divided into two different sources:

1. True sharing misses: Theses misses are caused by the communication of data through the cache coherence mechanism. For example, in an invalidation-based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block. Subsequently, when another processor attempts to read a modified word in that cache block, a miss occurs and the resultant block is transferred.

2. False sharing misses: These misses arise from the use of an invalidation-based coherence protocol with a single valid bit per cache block. False sharing occurs when a block is invalidated and a subsequent reference causes a miss because some word in the block, other than one being read, is written into.

The effect of coherence misses become significant for tightly coupled applications that share significant amount of data for example commercial workloads.

Effect of false sharing

If the word (in a block) modified is actually used by the processor that received the invalidation, then the reference was a true sharing reference and would have caused a miss independent of the block size. If, however, the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss. It is evident if the cache block size is increased, false sharing misses would also increase. Though, large block sizes exploit locality and decrease the effective memory access time, it also has a tendency to group data together even though only a part of it is needed by any one processor. In general, past studies have shown the miss rate of the data cache in multiprocessors changes less predictably than in uniprocessors with increasing cache block size [2]. Following figures depicts the impact of false sharing for different parallel application using 32 processors:

Figure 1: The number of false sharing misses and true sharing misses that happen during executions of each parallel application (a) Barnes and (b) Cholesky [3].

Not only do false misses increase the latency of memory accesses, they also generate traffic between processors and memory. As the block size increases, a miss produces a higher volume of traffic. The traffic increase with larger blocks occurs because many of the words transferred are not used. Between two consecutive misses on a given block, a processor usually references a very small number of distinct words in that block.

Techniques to reduce false sharing misses

There have been various techniques proposed both at hardware and software level to reduce false sharing misses. Following are some of the popular and accepted techniques to combat false sharing issue:

1. Data placement optimizations: This method addresses the problem of reducing false sharing misses on shared data by enhancing the spatial locality of shared data. The placement of data structures in cache blocks is optimized by using local changes that are programmer-transparent and have general applicability. This approach is partly motivated by the fact that cache misses on shared data are often concentrated in small sections of the shared data address space. Therefore, local actions involving relatively few bytes may yield most of the desired effects. Four optimizations of the data layout have been suggested in [4] to reduce false sharing cache misses in a multiprocessor environment:

[a] SplitScalar: Place scalar variables that cause false sharing in different blocks. Given a cache block with scalar variables where the increase in misses due to prefetching exceeds 0.5% of the program misses, allocate each of them to an empty cache block.

[b] HeapAllocate: Allocate shared space from different heap regions according to which processor requests the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing

[c] Expand Record: Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block. If the multi-word simulation indicates that there is much false sharing and little gain in prefetching, then consider expansion. If the reverse is true, do not apply the optimization.

[d] Align Record: Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing. This optimization is possible when the number of words in the record and in the cache block have a greater common divisor (GCD) larger than 1.The array is laid out at a distance from a block boundary equal to 0 or a multiple of the GCD, whichever wastes less space.

2. Dynamic Cache sub-block Design to Reduce False Sharing [5]: In this method false sharing is minimized by trying to dynamically locate the point of false reference. Sharing traffic is minimized by maintaining coherence on smaller blocks (sub-blocks) which are truly shared, whereas larger blocks are used as the basic units of transfer. The sub-block sizes are dynamically determined with the objective of maximizing the size of a truly shared sub-block which will settle to a value which is most suitable for the application being run and also minimize the invalidation misses due to false partitioning of true shared blocks. Larger blocks exploit locality while coherence is maintained on sub-blocks which minimize bus traffic due to shared misses. Figure 2 shows the cache organization to support dynamic cache sub-block design.

      Figure 2: New Architecture for the Dynamic Sub-block Coherence Scheme

The simulation results indicate that the dynamic sub-block protocol reduces the false sharing misses by 20 to 90 percent over the fixed sub-block scheme (Figure 3).

Figure 3: The percentage reduction in false sharing miss rate in the DSB scheme compared to the FSB scheme for MP3D, Floyd, Water and Jacobi vs block size.

Techniques to reduce true sharing misses

True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing, programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors. High levels of true sharing and synchronization can easily overwhelm the advantage of parallelism. In fact true misses are required for the correct execution of the program. The techniques used for true sharing us majoring involved with reducing the latency and bus traffic because of the miss. One of the proposed technique is to have a private L1 cache and shared L2 cache among all the processors [7].


References

[1] Computer Architecture, Fourth Edition: A Quantitative Approach by John L. Hennessy , David A. Patterson [2] S. J. Eggers and R. H. Katz, “The effect of sharing on the cache and bus performance of parallel programs,” in Proc. 3rd Int. Conf Architectural Support for Programming Lung. and Operating Syst., Apr. 1989, pp.257-270.

[3] J. Lee and Y Cho, “An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors”, in Algorithms and Architectures for Parallel Processing, 1996.

[4] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates. In Proceedings of the 1990 International Conference on Parallel Processang, volume II(Software), pages 266-270.

[5] Murali Kadiyala and Laxmi N. Bhuyan, “A Dynamic Cache sub-block Design to Reduce False Sharing” IEEE International Conference on Computer Design, 1995.

[6] http://suif.stanford.edu/papers/anderson95/node2.html

[7] Xuemei Zhao, Karl Sammut, Fangpo He, Shaowen Qin, Split Private and Shared L2 Cache Architecture for Snooping-based CMP