<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Kperi</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Kperi"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Kperi"/>
	<updated>2026-06-06T20:23:19Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7596</id>
		<title>Talk:CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7596"/>
		<updated>2007-10-25T00:30:52Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Dear Reviewer,&lt;br /&gt;
                We have updated with more information in the introduction to false sharing and true sharing.&lt;br /&gt;
                The grammatical mistakes have also been corrected.&lt;br /&gt;
                Thank you for your feedback .&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7595</id>
		<title>Talk:CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7595"/>
		<updated>2007-10-25T00:30:39Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Dear Reviewer,&lt;br /&gt;
                We have updated with more information in the introduction to false sharing and true sharing.&lt;br /&gt;
 The grammatical mistakes have also been corrected.&lt;br /&gt;
 Thank you for your feedback .&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7594</id>
		<title>Talk:CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Talk:CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7594"/>
		<updated>2007-10-25T00:30:20Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Dear Reviewer,&lt;br /&gt;
                We have updated with more information in the introduction to false sharing and true sharing. The grammatical mistakes have also been corrected. Than you for your feedback .&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7593</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7593"/>
		<updated>2007-10-25T00:28:19Z</updated>

		<summary type="html">&lt;p&gt;Kperi: /* ''Introduction'' */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have  temporal locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors; high levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': False sharing occurs when two or more processors access(and at least one of them writes) different data elements in the same coherence unit (cache line, memory page, etc.) . A severe form of false sharing occurs when an array dimension&lt;br /&gt;
that exhibits spatial reuse is accessed by multiple writers, i.e., multiple processors that write to data in the same coherence&lt;br /&gt;
unit. For example, this form of false sharing might occur when each processor updates a different row of a two-dimensional array stored in column-major order.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
* False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
This article explores some of the above issues, and the links, that would deal with the last two techniques stated above, have been provided in the references section below.&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
This section, introduces to the reader, the concepts of compile time data transformations. in order to achieve reduced false sharing multiprocessors. Referring to  Paper by Tor E. Jeremiassen and Susan J. Eggers, would give a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed, that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns, that identifies data structures susceptible to false sharing, and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of the following three stages;&lt;br /&gt;
&lt;br /&gt;
* Determination of the section of code, each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Performing non-concurrency analysis, by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Performing a summary side-effect analysis, on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively), yields the sections of shared data that each processor reads and writes.The non-concurrency analysis (stage – 2), uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''group and transpose''' – To address condition 1.&lt;br /&gt;
* '''padding''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
Group and transpose physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processor's data share a cache block. In addition to avoiding false sharing, this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, padding pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with the compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/26601/http:zSzzSzresearch.cs.tamu.eduzSzncstrlzSzTR95-010.pdf/kadiyala95dynamic.pdf]&lt;br /&gt;
A Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
&lt;br /&gt;
Murali Kadiyala and Laxmi N Bhuyan&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/573/http:zSzzSzwww.csg.lcs.mit.edu:8001zSzUserszSzvivekzSz.zSzpszSzChSa97.pdf/chow97false.pdf]&lt;br /&gt;
False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
Jyh-Herng Chow and Vivek Sarkar&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7589</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7589"/>
		<updated>2007-10-25T00:05:22Z</updated>

		<summary type="html">&lt;p&gt;Kperi: /* Compile Time Data Transformations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have  temporal locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors; high levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
* False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
This article explores some of the above issues, and the links, that would deal with the last two techniques stated above, have been provided in the references section below.&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
This section, introduces to the reader, the concepts of compile time data transformations. in order to achieve reduced false sharing multiprocessors. Referring to  Paper by Tor E. Jeremiassen and Susan J. Eggers, would give a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed, that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns, that identifies data structures susceptible to false sharing, and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of the following three stages;&lt;br /&gt;
&lt;br /&gt;
* Determination of the section of code, each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Performing non-concurrency analysis, by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Performing a summary side-effect analysis, on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively), yields the sections of shared data that each processor reads and writes.The non-concurrency analysis (stage – 2), uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''group and transpose''' – To address condition 1.&lt;br /&gt;
* '''padding''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
Group and transpose physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processor's data share a cache block. In addition to avoiding false sharing, this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, padding pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with the compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/26601/http:zSzzSzresearch.cs.tamu.eduzSzncstrlzSzTR95-010.pdf/kadiyala95dynamic.pdf]&lt;br /&gt;
A Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
&lt;br /&gt;
Murali Kadiyala and Laxmi N Bhuyan&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/573/http:zSzzSzwww.csg.lcs.mit.edu:8001zSzUserszSzvivekzSz.zSzpszSzChSa97.pdf/chow97false.pdf]&lt;br /&gt;
False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
Jyh-Herng Chow and Vivek Sarkar&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7588</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7588"/>
		<updated>2007-10-25T00:04:49Z</updated>

		<summary type="html">&lt;p&gt;Kperi: /* Compile Time Data Transformations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have  temporal locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors; high levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
* False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
This article explores some of the above issues, and the links, that would deal with the last two techniques stated above, have been provided in the references section below.&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
This section, introduces to the reader, the concepts of compile time data transformations. in order to achieve reduced false sharing multiprocessors. Referring to  Paper by Tor E. Jeremiassen and Susan J. Eggers, would give a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed, that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns, that identifies data structures susceptible to false sharing, and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of the following three stages;&lt;br /&gt;
&lt;br /&gt;
* Determination of the section of code, each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Performing non-concurrency analysis, by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Performing a summary side-effect analysis, on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively), yields the sections of shared data that each processor reads and writes.The non-concurrency analysis (stage – 2), uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''group and transpose''' – To address condition 1.&lt;br /&gt;
* '''padding''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processor's data share a cache block. In addition to avoiding false sharing, this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''padding''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with the compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/26601/http:zSzzSzresearch.cs.tamu.eduzSzncstrlzSzTR95-010.pdf/kadiyala95dynamic.pdf]&lt;br /&gt;
A Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
&lt;br /&gt;
Murali Kadiyala and Laxmi N Bhuyan&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/573/http:zSzzSzwww.csg.lcs.mit.edu:8001zSzUserszSzvivekzSz.zSzpszSzChSa97.pdf/chow97false.pdf]&lt;br /&gt;
False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
Jyh-Herng Chow and Vivek Sarkar&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7584</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7584"/>
		<updated>2007-10-24T23:52:28Z</updated>

		<summary type="html">&lt;p&gt;Kperi: /* ''Strategies to combat False Sharing'' */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have  temporal locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors; high levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
* False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
This article explores some of the above issues, and the links, that would deal with the last two techniques stated above, have been provided in the references section below.&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''group and transpose''' – To address condition 1.&lt;br /&gt;
* '''padding''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''padding''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/26601/http:zSzzSzresearch.cs.tamu.eduzSzncstrlzSzTR95-010.pdf/kadiyala95dynamic.pdf]&lt;br /&gt;
A Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
&lt;br /&gt;
Murali Kadiyala and Laxmi N Bhuyan&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/573/http:zSzzSzwww.csg.lcs.mit.edu:8001zSzUserszSzvivekzSz.zSzpszSzChSa97.pdf/chow97false.pdf]&lt;br /&gt;
False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
Jyh-Herng Chow and Vivek Sarkar&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7583</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=7583"/>
		<updated>2007-10-24T23:49:13Z</updated>

		<summary type="html">&lt;p&gt;Kperi: /* ''Introduction'' */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have  temporal locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors; high levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
* False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
This article would discuss only the first four techniques as part of this discussions. Links for the materials for the last two techniques have been provided in the references section below.&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''group and transpose''' – To address condition 1.&lt;br /&gt;
* '''padding''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''padding''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/26601/http:zSzzSzresearch.cs.tamu.eduzSzncstrlzSzTR95-010.pdf/kadiyala95dynamic.pdf]&lt;br /&gt;
A Dynamic Cache Sub-Block Design to Reduce False Sharing&lt;br /&gt;
&lt;br /&gt;
Murali Kadiyala and Laxmi N Bhuyan&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/573/http:zSzzSzwww.csg.lcs.mit.edu:8001zSzUserszSzvivekzSz.zSzpszSzChSa97.pdf/chow97false.pdf]&lt;br /&gt;
False Sharing Elimination by Selection of Runtime Scheduling Parameters&lt;br /&gt;
&lt;br /&gt;
Jyh-Herng Chow and Vivek Sarkar&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=6034</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=6034"/>
		<updated>2007-10-19T23:47:12Z</updated>

		<summary type="html">&lt;p&gt;Kperi: /* Compile Time Data Transformations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''group and transpose''' – To address condition 1.&lt;br /&gt;
* '''padding''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''padding''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=6033</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=6033"/>
		<updated>2007-10-19T23:45:46Z</updated>

		<summary type="html">&lt;p&gt;Kperi: /* ''Strategies to combat False Sharing'' */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''''group and transpose''''' – To address condition 1.&lt;br /&gt;
* '''''padding''''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''''padding''''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5883</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5883"/>
		<updated>2007-10-18T01:06:46Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Topic for Discussion'' ==&lt;br /&gt;
&lt;br /&gt;
True and false sharing. In Lectures 9 and 10, we covered performance results for true- and false-sharing misses. The results showed that some applications experienced degradation due to false sharing, and that this problem was greater with larger cache lines. But these data are at least 9 years old, and for multiprocessors that are smaller than those in use today. Comb the ACM Digital Library, IEEE Xplore, and the Web for more up-to-date results. What strategies have proven successful in combating false sharing? Is there any research into ways of diminishing true-sharing misses, e.g., by locating communicating processes on the same processor? Wouldn't this diminish parallelism and thus hurt performance?&lt;br /&gt;
&lt;br /&gt;
== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''''group and transpose''''' – To address condition 1.&lt;br /&gt;
* '''''padding''''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''''padding''''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;br /&gt;
&lt;br /&gt;
[https://www.it.uu.se/research/publications/reports/2003-044/2003-044-nc.pdf]&lt;br /&gt;
Cache Memory Behavior of Advanced PDE Solvers&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5880</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5880"/>
		<updated>2007-10-18T01:00:31Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.&lt;br /&gt;
&lt;br /&gt;
The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''''group and transpose''''' – To address condition 1.&lt;br /&gt;
* '''''padding''''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''''padding''''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdf]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5879</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5879"/>
		<updated>2007-10-18T00:56:49Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
&lt;br /&gt;
=== Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Sectored Caches ===&lt;br /&gt;
&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent. Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[1] Sectored Caches &lt;br /&gt;
&lt;br /&gt;
=== Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.&lt;br /&gt;
&lt;br /&gt;
The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''''group and transpose''''' – To address condition 1.&lt;br /&gt;
* '''''padding''''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''''padding''''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Diminishing True sharing misses ==&lt;br /&gt;
A new technique, called coherence de-coupling, breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup SCL)protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus,coherence decoupling can greatly reduce if not eliminate  the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. The simulations and results can be viewed at the link below.&lt;br /&gt;
&lt;br /&gt;
[ftp://ftp.cs.utexas.edu/pub/dburger/papers/ASPLOS04_cd.pdfs]&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5878</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5878"/>
		<updated>2007-10-18T00:52:26Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. &lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat False Sharing'' ==&lt;br /&gt;
&lt;br /&gt;
Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.&lt;br /&gt;
&lt;br /&gt;
* Reducing False Sharing through Proper Block Sizing&lt;br /&gt;
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks&lt;br /&gt;
* Reducing False Sharing through Compile Time Data Transformations&lt;br /&gt;
* Reducing False Sharing through Sectored Caches&lt;br /&gt;
&lt;br /&gt;
=== Reducing False Sharing through Proper Block Sizing ===&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
=== Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks ===&lt;br /&gt;
&lt;br /&gt;
A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.&lt;br /&gt;
&lt;br /&gt;
* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''&lt;br /&gt;
&lt;br /&gt;
* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
=== Reducing False Sharing through Compile Time Data Transformations ===&lt;br /&gt;
&lt;br /&gt;
Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the  Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.&lt;br /&gt;
&lt;br /&gt;
A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.&lt;br /&gt;
&lt;br /&gt;
The compiler analysis comprises of three stages&lt;br /&gt;
&lt;br /&gt;
* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).&lt;br /&gt;
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.&lt;br /&gt;
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).&lt;br /&gt;
&lt;br /&gt;
The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.&lt;br /&gt;
&lt;br /&gt;
The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.&lt;br /&gt;
&lt;br /&gt;
In order to reduce the number of false sharing misses, data must be restructured so that:&lt;br /&gt;
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.&lt;br /&gt;
* Write shared data objects with no processor locality do not share cache lines.&lt;br /&gt;
&lt;br /&gt;
Two transformations have been devised to achieve the above to conditions. &lt;br /&gt;
* '''''group and transpose''''' – To address condition 1.&lt;br /&gt;
* '''''padding''''' – To address condition 2.&lt;br /&gt;
&lt;br /&gt;
'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.&lt;br /&gt;
&lt;br /&gt;
The second transformation, '''''padding''''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.&lt;br /&gt;
&lt;br /&gt;
The speedups achieved with and without compile time data transformations for a few test programs are given below.&lt;br /&gt;
                                  [[Image:Plots.jpg]]&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;br /&gt;
&lt;br /&gt;
[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]&lt;br /&gt;
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.&lt;br /&gt;
Tor E. Jeremiassen and Susan J. Eggers&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5862</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5862"/>
		<updated>2007-10-18T00:11:05Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;br /&gt;
==== Proper Block Sizing ====&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
==== Sectored Caches ====&lt;br /&gt;
Sectored caches, can yet be another effective strategy to reduce the false sharing misses, in bus based multiprocessors. The sectored cache is organized such that each line is divided into basic coherence units, called subblocks. When a false sharing occurs, the involved cache line need not be invalidates or transferred, as long as the corresponding subblocks are kept coherent.&lt;br /&gt;
Simulations on this strategy have helped in effectively reducing false sharing misses by about 30-80%. details regarding the simulation are explained in the paper linked below.&lt;br /&gt;
&lt;br /&gt;
[http://citeseer.ist.psu.edu/cache/papers/cs/5048/ftp:zSzzSzpads10.cs.nthu.edu.twzSzpubzSzpaperszSzkcliuzSzicpads97.pdf/liu97effectiveness.pdf]&lt;br /&gt;
Sectored Caches &lt;br /&gt;
==== Data placement optimizations ====&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5857</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5857"/>
		<updated>2007-10-17T23:55:32Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;br /&gt;
==== Proper Block Sizing ====&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Data placement optimizations ====&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=Strategies_to_combat_%E2%80%9CFalse_Sharing%E2%80%9D&amp;diff=5856</id>
		<title>Strategies to combat “False Sharing”</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Strategies_to_combat_%E2%80%9CFalse_Sharing%E2%80%9D&amp;diff=5856"/>
		<updated>2007-10-17T23:53:31Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Proper Block Sizing ==&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Data placement optimizations ==&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=Strategies_to_combat_%E2%80%9CFalse_Sharing%E2%80%9D&amp;diff=5855</id>
		<title>Strategies to combat “False Sharing”</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Strategies_to_combat_%E2%80%9CFalse_Sharing%E2%80%9D&amp;diff=5855"/>
		<updated>2007-10-17T23:52:46Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
== Proper Block Sizing ==&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Data placement optimizations ==&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5854</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5854"/>
		<updated>2007-10-17T23:52:37Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''[[Strategies to combat “False Sharing”]]'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5846</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5846"/>
		<updated>2007-10-17T21:50:15Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''[[Strategies to combat “False Sharing”]]''&lt;br /&gt;
&lt;br /&gt;
== Proper Block Sizing ==&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Data placement optimizations ==&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5845</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5845"/>
		<updated>2007-10-17T21:47:32Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[''Strategies to combat “False Sharing”'']]&lt;br /&gt;
&lt;br /&gt;
== Proper Block Sizing ==&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Data placement optimizations ==&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5844</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5844"/>
		<updated>2007-10-17T21:46:54Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[''Strategies to combat “False Sharing”'']] &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5843</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5843"/>
		<updated>2007-10-17T21:46:26Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== [[''Strategies to combat “False Sharing”'']] ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5842</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5842"/>
		<updated>2007-10-17T21:46:01Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[== ''Strategies to combat “False Sharing”'' ==]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5840</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5840"/>
		<updated>2007-10-17T21:42:39Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Proper Block Sizing ==&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Data placement optimizations ==&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;br /&gt;
&lt;br /&gt;
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]&lt;br /&gt;
Analysis of Shared Memory Misses and Reference Patterns&lt;br /&gt;
Jeffrey B. Rothman and Alan Jay Smith&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5838</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5838"/>
		<updated>2007-10-17T21:41:35Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Proper Block Sizing ==&lt;br /&gt;
&lt;br /&gt;
An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Data placement optimizations ==&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5835</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5835"/>
		<updated>2007-10-17T21:19:15Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;br /&gt;
&lt;br /&gt;
*  Data placement optimizations &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5833</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5833"/>
		<updated>2007-10-17T21:18:36Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;br /&gt;
&lt;br /&gt;
*  Data placement optimizations &lt;br /&gt;
&lt;br /&gt;
(a) '''SplitScalar:''' Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
&lt;br /&gt;
(b) '''Heap Allocate:''' Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
&lt;br /&gt;
(c) '''Expand Record:''' Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
&lt;br /&gt;
(d)''' Align Record:''' Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
&lt;br /&gt;
(e) '''Lockscalar:''' Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
False Sharing and Spatial Locality in Multiprocessor Caches&lt;br /&gt;
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5832</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5832"/>
		<updated>2007-10-17T21:16:27Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;br /&gt;
&lt;br /&gt;
*  Data placement optimizations &lt;br /&gt;
&lt;br /&gt;
(a) SplitScalar: Place scalar variables that cause false sharing in different blocks.&lt;br /&gt;
(b) HeapAllocate: Allocate shared space from different heap regions according to which processor request the space. It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.&lt;br /&gt;
(c) Expand Record: Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records. While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.&lt;br /&gt;
(d) Align Record: Choose a layout for arrays of records that minimizes the number of blocks the average record spans. This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.&lt;br /&gt;
(e) Lockscalar: Place active scalars that are protected by a lock in the same block as the lock variable. As a result, the scalar is prefetched when the lock is accessed.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.&lt;br /&gt;
&lt;br /&gt;
==''References''==&lt;br /&gt;
&lt;br /&gt;
[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5828</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5828"/>
		<updated>2007-10-17T21:04:44Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, sone with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5826</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5826"/>
		<updated>2007-10-17T20:59:20Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors.&lt;br /&gt;
&lt;br /&gt;
                                         [[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Miss_rate_vs_block_size.jpg&amp;diff=5823</id>
		<title>File:Miss rate vs block size.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Miss_rate_vs_block_size.jpg&amp;diff=5823"/>
		<updated>2007-10-17T20:56:29Z</updated>

		<summary type="html">&lt;p&gt;Kperi: Cache misses on the shared data as a function of the block size&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Cache misses on the shared data as a function of the block size&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5822</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5822"/>
		<updated>2007-10-17T20:55:02Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors.&lt;br /&gt;
&lt;br /&gt;
[[Image:miss rate vs block size.jpg]]&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5819</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5819"/>
		<updated>2007-10-17T20:52:22Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability. A very important parameter that affects false sharing, is the block size in a cache.An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors.&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5816</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5816"/>
		<updated>2007-10-17T20:41:08Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors.  The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
==''Problem with False Sharing'' ==&lt;br /&gt;
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.&lt;br /&gt;
&lt;br /&gt;
But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.  &lt;br /&gt;
 &lt;br /&gt;
This situation is called false sharing, and might become a bottleneck in the path of performance and scalability.&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5811</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5811"/>
		<updated>2007-10-17T20:25:05Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== ''Introduction'' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors.  The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;br /&gt;
&lt;br /&gt;
== ''Strategies to combat “False Sharing”'' ==&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5808</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5808"/>
		<updated>2007-10-17T20:14:50Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
== '''''Introduction''''' ==&lt;br /&gt;
&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors.  The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5807</id>
		<title>CSC/ECE 506 Fall 2007/wiki3 1 satkar</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Fall_2007/wiki3_1_satkar&amp;diff=5807"/>
		<updated>2007-10-17T20:14:28Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''''Introduction'''''&lt;br /&gt;
----&lt;br /&gt;
The cache organization plays a key role in the modern computers, especially in the multiprocessors.  The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;&lt;br /&gt;
&lt;br /&gt;
•	'''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing.&lt;br /&gt;
&lt;br /&gt;
•	'''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.&lt;br /&gt;
&lt;br /&gt;
Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=2.Blade_Servers&amp;diff=3391</id>
		<title>2.Blade Servers</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=2.Blade_Servers&amp;diff=3391"/>
		<updated>2007-09-11T01:12:27Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Blade servers are a revolutionary new concept for enterprise applications currently using a “stack of PC servers” approach. Blade servers promise to greatly increase compute density, reduce cost, improve reliability, and simplify cabling. Companies such as Dell, Hewlett Packard, IBM, RLX, and Sun offer blade server solutions that reduce operating expense while increasing services density. Blade servers form the basis for a modular computing paradigm.&lt;br /&gt;
&lt;br /&gt;
== Evolution ==&lt;br /&gt;
For many years, traditional standalone servers grew larger and faster, taking on more and more tasks as networked computing expanded. New servers were added to data centers as the need arose, often as a quick fix with little coordination or planning; it was not unusual for data center operators to discover that servers had been added without their knowledge. The resulting complexity of boxes and cabling became a growing invitation to confusion, mistakes, and inflexibility.&lt;br /&gt;
&lt;br /&gt;
                                                [[Image:Conventional Servers.jpg]]&lt;br /&gt;
                                                    Figure : Conventional Servers&lt;br /&gt;
Blade servers, first appearing in 2001, are a very simple and pure example of modular architecture – the blades in a blade server chassis are physically identical, with identical processors, ready to be configured and used for any purpose desired by the user. Their introduction brought many benefits of modularity to the server landscape – scalability, ease of duplication, specialization of function, and adaptability.Blade servers were developed in response to a critical and growing need in the datacenter: the requirement to increase server performance and availability without dramatically increasing the size, cost and management complexity of an ever growing data center. To keep up with user demand and because of the space and power demands of traditional tower and rackmount servers, data centers are being forced to expand their physical plant at an alarming rate.&lt;br /&gt;
&lt;br /&gt;
                          [[Image:Blade Server.jpg]]&lt;br /&gt;
&lt;br /&gt;
But while these classic modular advantages have given blade servers a growing presence in data centers, their full potential awaits the widespread implementation of one remaining critical capability of modular design: fault tolerance. Fault tolerant blade servers – ones with built-in “failover” logic to transfer operation from failed to healthy blades – have only recently started to become available and affordable. The reliability of such fault tolerant servers will surpass that of current techniques involving redundant software and clusters of single servers, putting blade servers in a position to become the dominant server architecture of data centers. With the emergence of automated fault tolerance, industry observers predict rapid migration to blade servers over the forthcoming years.&lt;br /&gt;
&lt;br /&gt;
== General blade server architecture ==&lt;br /&gt;
&lt;br /&gt;
A general blade server architecture is shown in the figure. The hardware components of a blade server are the switch blade, chassis (with fans, temperature sensors, etc), and multiple compute blades. Some vendors offer, partner, or plan to partner with companies that provide application specific blades that provide traffic&lt;br /&gt;
conditioning, protection, or network processing prior to the traffic reaching the compute blades. Often, these application specific&lt;br /&gt;
blades may be functionally positioned between the switch blade and compute blades. However, these blades reside in a standard&lt;br /&gt;
compute blade slot.&lt;br /&gt;
&lt;br /&gt;
                                  &lt;br /&gt;
The outside world connects through the rear of the chassis to a switch card in the blade server. The switch card is provisioned to&lt;br /&gt;
distribute packets to blades within the blade server. All these components are wrapped together with network management system&lt;br /&gt;
software provided by the blade server vendor. The specifics on the blade server architecture vary from vendor to vendor. But before&lt;br /&gt;
you discount this as a bunch of proprietary architectures, think again. Remember that IBM and others dramatically advanced and&lt;br /&gt;
proliferated the PC architecture, changing the face of computing forever. &lt;br /&gt;
 &lt;br /&gt;
The blade server industry appears to be headed in the same direction. There are some areas where standardization of blade&lt;br /&gt;
server components will prove helpful. However, blade server vendors ability to quickly adapt and advance their architectures to&lt;br /&gt;
suite specific applications unencumbered by the standards process will prove to accelerate proliferation in the near term.&lt;br /&gt;
&lt;br /&gt;
== Blade Enclosure ==&lt;br /&gt;
&lt;br /&gt;
The enclosure (or chassis) performs many of the non-core computing services found in most computers. Non-blade computers require components that are bulky, hot and space-inefficient, and duplicated across many computers that may or may not be performing at capacity. By locating these services in one place and sharing them between the blade computers, the overall utilization is more efficient. The specifics of which services are provided and how vary by vendor.&lt;br /&gt;
&lt;br /&gt;
'''Power'''&lt;br /&gt;
&lt;br /&gt;
Computers operate over a range of DC voltages, yet power is delivered from utilities as AC, and at higher voltages than required within the computer. Converting this current requires power supply units (or PSUs). To ensure that the failure of one power source does not affect the operation of the computer, even entry-level servers have redundant power supplies, again adding to the bulk and heat output of the design.&lt;br /&gt;
&lt;br /&gt;
The blade enclosure's power supply provides a single power source for all blades within the enclosure. This single power source may be in the form of a power supply in the enclosure or a dedicated separate PSU supplying DC to multiple enclosures [1]. This setup not only reduces the number of PSUs required to provide a resilient power supply, but it also improves efficiency because it reduces the number of idle PSUs. In the event of a PSU failure the blade chassis throttles down individual blade server performance until it matches the available power. This is carried out in steps of 12.5% per CPU until power balance is achieved.&lt;br /&gt;
&lt;br /&gt;
'''Cooling'''&lt;br /&gt;
&lt;br /&gt;
During operation, electrical and mechanical components produce heat, which must be displaced to ensure the proper functioning of the components. In blade enclosures, as in most computing systems, heat is removed with fans.&lt;br /&gt;
&lt;br /&gt;
A frequently underestimated problem when designing high-performance computer systems is the conflict between the amount of heat a system generates and the ability of its fans to remove the heat. The blade's shared power and cooling means that it does not generate as much heat as traditional servers. Newer blade enclosure designs feature high speed, adjustable fans and control logic that tune the cooling to the systems requirements.[2]&lt;br /&gt;
&lt;br /&gt;
At the same time, the increased density of blade server configurations can still result in higher overall demands for cooling when a rack is populated at over 50%. This is especially true with early generation blades. In absolute terms, a fully populated rack of blade servers is likely to require more cooling capacity than a fully populated rack of standard 1U servers.&lt;br /&gt;
&lt;br /&gt;
'''Networking'''&lt;br /&gt;
&lt;br /&gt;
Computers are increasingly being produced with high-speed, integrated network interfaces, and most are expandable to allow for the addition of connections that are faster, more resilient and run over different media (copper and fiber). These may require extra engineering effort in the design and manufacture of the blade, consume space in both the installation and capacity for installation (empty expansion slots) and hence more complexity. High-speed network topologies require expensive, high-speed integrated circuits and media, while most computers do not utilise all the bandwidth available.&lt;br /&gt;
&lt;br /&gt;
The blade enclosure provides one or more network buses to which the blade will connect, and either presents these ports individually in a single location (versus one in each computer chassis), or aggregates them into fewer ports, reducing the cost of connecting the individual devices. These may be presented in the chassis itself, or in networking blades[3].&lt;br /&gt;
&lt;br /&gt;
'''Storage'''&lt;br /&gt;
&lt;br /&gt;
While computers typically need hard-disks to store the operating system, application and data for the computer, these are not necessarily required locally. Many storage connection methods (e.g. FireWire, SATA, SCSI, DAS, Fibre Channel and iSCSI) are readily moved outside the server, though not all are used in enterprise-level installations. Implementing these connection interfaces within the computer presents similar challenges to the networking interfaces (indeed iSCSI runs over the network interface), and similarly these can be removed from the blade and presented individually or aggregated either on the chassis or through other blades.&lt;br /&gt;
&lt;br /&gt;
The ability to boot the blade from a storage area network (SAN) allows for an entirely disk-free blade. This may have higher processor density or better reliability than systems having individual disks on each blade.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Advantages of Blade Servers ==&lt;br /&gt;
&lt;br /&gt;
     '''Reduced Space Requirements''' - Greater density provides up to 35 to 45 percent improvement compared to tower or rackmounted &amp;lt;br&amp;gt;servers.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
     '''Reduced Power Consumption and Improved Power Management''' - consolidating power supplies into the blade chassis reduces the number&amp;lt;br&amp;gt; of separate power supplies needed and reduces the power requirements per server.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
     '''Lower Management Cost''' - server consolidation and resource centralization simplifies server deployment, management and &amp;lt;br&amp;gt;administration and improves management and control.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
    ''' Simplified Cabling''' - rack mount servers, while helping consolidate servers into a centralized location, create wiring &amp;lt;br&amp;gt;proliferation. Blade servers simplify cabling requirements and reduce wiring by up to 70 percent. Power cabling, operator wiring &amp;lt;br&amp;gt;(keyboard, mouse, etc.) and communications cabling (Ethernet, SAN connections, cluster connection) are greatly reduced.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
     '''Future Proofing Through Modularity''' - as new processor, communications, storage and interconnect technology becomes available, it &amp;lt;br&amp;gt;can be implemented in blades that install into existing equipment, upgrading server operation at a minimum cost and with no &amp;lt;br&amp;gt;disruption of basic server functionality.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
    ''' Easier Physical Deployment''' - once a blade server chassis has been installed, adding additional servers is merely a matter of &amp;lt;br&amp;gt;sliding in additional blades into the chassis. Software management tools simplify the management and reporting functions for blade &amp;lt;br&amp;gt;servers. Redundant power modules and consolidated communication bays simplify integration into datacenters and increase &amp;lt;br&amp;gt;reliability.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
== Are blade servers an extension of message passing ? ==&lt;br /&gt;
&lt;br /&gt;
Blade servers use message passing in order to achieve fast and efficient performance. Parallel computing frequently relies upon message passing to exchange information between computational units. In high-performance computing, the most common message passing technology is the '''Message Passing Interface (MPI)''', which is being developed in an open-source implementation supported by Cisco Systems® and other vendors.&lt;br /&gt;
&lt;br /&gt;
High performance computing (HPC) Cluster applications require a high performance interconnect for blade servers to achieve fast and efficient performance for computation-intensive applications.When messages are passed between nodes , some time is spent transmitting these messages, and depending on the frequency of the data synchronization between processes, that factor can have a significant effect on total application run time. It is critically important to understand how the application works with respect to interprocess communications patterns and the frequency of updates, because these affect the performance and design of the parallel application, the design of the interconnecting network, and the choice of network technology.&lt;br /&gt;
&lt;br /&gt;
Using traditional transport protocols such as TCP/IP, the CPU is responsible for managing how data is moved between I/O memory and&lt;br /&gt;
for transport protocol processing. The effect of this is that time spent in communicating between nodes is time not spent on processing the application. Therefore, minimizing communications time is a key consideration for certain classes of applications.&lt;br /&gt;
&lt;br /&gt;
MPI is “middleware” software that sits between the application and the network hardware. It provides a portable mechanism to enable messages to be exchanged between processes regardless of the underlying network or parallel computational environment. As such,implementations of the MPI standard use underlying communications stacks such as TCP or UDP over IP, InfiniBand, or Myrinet to communicate between processes. MPI offers a rich set of functions that can be combined in simple or complex ways to solve any type of parallel computation. The ability to exchange messages enables instructions or data to be passed between nodes to distribute data sets for calculation. MPI has been implemented on a wide variety of platforms, operating systems, and cluster and supercomputer architectures.&lt;br /&gt;
&lt;br /&gt;
See Also [http://h41112.www4.hp.com/promo/blades-community/eur/en/library/articles/Both_worldspdf.pdf] '''The best of both worlds&lt;br /&gt;
'''&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=3.References&amp;diff=3124</id>
		<title>3.References</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=3.References&amp;diff=3124"/>
		<updated>2007-09-06T03:47:00Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[http://en.wikipedia.org/wiki/Blade_server] Wikipedia&lt;br /&gt;
&lt;br /&gt;
[http://www.compactpci-systems.com/columns/software_corner/pdfs/3.03.pdf] www.compactpci-systems.com&lt;br /&gt;
&lt;br /&gt;
[http://www.bladeserverscenter.com/i_technology.shtml] www.bladeserverscenter.com&lt;br /&gt;
&lt;br /&gt;
[http://www.blade.org/techover.cfm] www.blade.org&lt;br /&gt;
&lt;br /&gt;
[http://h41112.www4.hp.com/promo/blades-community/eur/en/library/articles/Both_worldspdf.pdf] www.hp.com&lt;br /&gt;
&lt;br /&gt;
[http://www.terian.com/terianprods.asp?s=Blades] www.terian.com&lt;br /&gt;
&lt;br /&gt;
[http://www.cisco.com/application/pdf/en/us/guest/netsol/ns500/c654/cdccont_0900aecd804ab4ce.pdf]www.cisco.com&lt;br /&gt;
&lt;br /&gt;
[http://www.hpcx.ac.uk/support/training/MPP.html]www.hpcx.ac.uk&lt;br /&gt;
&lt;br /&gt;
[http://docs.hp.com/en/B6060-96018/ch01s01.html] www.docs.hp.com&lt;br /&gt;
&lt;br /&gt;
David E. Culler, Jaswinder Pal Singh, with Anoop Gupta,&lt;br /&gt;
Parallel Computer Architecture: A Hardware/Software Approach, © 1999 Morgan-Kauffman&lt;br /&gt;
&lt;br /&gt;
Modular Systems: The Evolution of Reliability &lt;br /&gt;
White Paper #76 by Neil Rasmussen Suzanne Niles&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=3.References&amp;diff=3123</id>
		<title>3.References</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=3.References&amp;diff=3123"/>
		<updated>2007-09-06T03:46:40Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[http://en.wikipedia.org/wiki/Blade_server] Wikipedia&lt;br /&gt;
&lt;br /&gt;
[http://www.compactpci-systems.com/columns/software_corner/pdfs/3.03.pdf] www.compactpci-systems.com&lt;br /&gt;
&lt;br /&gt;
[http://www.bladeserverscenter.com/i_technology.shtml] www.bladeserverscenter.com&lt;br /&gt;
&lt;br /&gt;
[http://www.blade.org/techover.cfm] www.blade.org&lt;br /&gt;
&lt;br /&gt;
[http://h41112.www4.hp.com/promo/blades-community/eur/en/library/articles/Both_worldspdf.pdf] www.hp.com&lt;br /&gt;
&lt;br /&gt;
[http://www.terian.com/terianprods.asp?s=Blades] www.terian.com&lt;br /&gt;
&lt;br /&gt;
[http://www.cisco.com/application/pdf/en/us/guest/netsol/ns500/c654/cdccont_0900aecd804ab4ce.pdf]www.cisco.com&lt;br /&gt;
&lt;br /&gt;
[http://www.hpcx.ac.uk/support/training/MPP.html]www.hpcx.ac.uk&lt;br /&gt;
&lt;br /&gt;
[http://docs.hp.com/en/B6060-96018/ch01s01.html]http://docs.hp.com&lt;br /&gt;
&lt;br /&gt;
David E. Culler, Jaswinder Pal Singh, with Anoop Gupta,&lt;br /&gt;
Parallel Computer Architecture: A Hardware/Software Approach, © 1999 Morgan-Kauffman&lt;br /&gt;
&lt;br /&gt;
Modular Systems: The Evolution of Reliability &lt;br /&gt;
White Paper #76 by Neil Rasmussen Suzanne Niles&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3122</id>
		<title>1.Message passing</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3122"/>
		<updated>2007-09-06T03:45:53Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Parallel programming requires interaction between the various processes that are simultaneously run on the individual processors and this is enabled by passing messages between the various processors. This important class of parallel machines, called Message-passing architectures, employs complete computers as building blocks including the microprocessor memory and the I/O system and provides communication between processors as explicit I/O operations. This style of architecture has much in common with the network of workstations, or clusters, except that the packaging of nodes is typically much tighter and the network is of much higher capability than a standard local area network.&lt;br /&gt;
&lt;br /&gt;
The world's largest supercomputers are used almost exclusively to run applications which are parallelised using Message Passing. The course covers all the basic knowledge required to write parallel programs using this programming model, and is directly applicable to almost every parallel computer architecture.&lt;br /&gt;
&lt;br /&gt;
Parallel programming by definition involves co-operation between processes to solve a common task. The programmer has to define the tasks that will be executed by the processors, and also how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together.&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing2.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Message Passing ==&lt;br /&gt;
&lt;br /&gt;
In message passing, a substantial distance exists between the programming model and the actual hardware primitives, with user communication performed through operating systems or library calls that perform the low-level actions including the actual communication operation. The most common user-level communication operations on message passing are variants of the send and receive. In its simplest form send specifies a local data buffer that is to be transmitted and a receiving process(typically on a remote processor).Receive specifies a sending process and a local data buffer into which the transmitted data is to be placed.together a matching send and receive causes a data transfer from one processor to another.In most message passing systems, the send process also allows an identifier or tag to be attached to the message, and the receiving operation specifies a matching rule( such as a specific tag from a specific processor)&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing.jpg]]&lt;br /&gt;
&lt;br /&gt;
The combination of a send and a matching receive accomplishes a memory to memory copy, where each end specifies its local data address, and a pair wise synchronization event. There are several possible variants of this synchronization event, depending upon whether the send completes when the receive has been executed, when the send buffer is available for reuse, or when the request has been accepted. Similarly, the receive can potentially wait until a matching send occurs or simply post the receive. Each of these variants have somewhat different semantics and different implementation requirements. Message passing has long been used as a means of communication and synchronization among arbitrary collections of cooperating sequential processes, even on a single processor. Important examples include programming languages, such as CSP and Occam, and common operating systems functions, such as sockets. Parallel programs using message passing are typically quite structured, like their shared-memory counter parts. Most often, all nodes execute identical copies of a program, with the same code and private variables. Usually, processes can name each other using a simple linear ordering of the processes comprising a program. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Typical Structure ==&lt;br /&gt;
&lt;br /&gt;
Early message passing machines provided hardware primitives that were very close to the simple send/receive user-level communication abstraction, with some additional restrictions. A node was connected to a fixed set of neighbors in a regular pattern by point-to-point links that behaved as simple FIFOs. Most early machines were hypercubes, where each node is connected to n other nodes differing by one bit in the binary address, for a total of 2^n nodes, or meshes, where the nodes are connect to neighbors on two or three dimensions. The network topology was especially important in the early message passing machines, because only the neighboring processors could be named in a send or receive operation. The data transfer involved the sender writing into a link and then writing the message until the receiver started reading it, so the send would block until the receive occurred. In modern terms this is called synchronous message passing because the two events coincide in time. The details of moving data were hidden from the programmer in a message passing library, forming a layer of software between send and receive calls and the actual hardware.&lt;br /&gt;
&lt;br /&gt;
[[Image:Hypercube.jpg]] &amp;lt;br&amp;gt;&lt;br /&gt;
Typical structure of an early message passing machines&lt;br /&gt;
&lt;br /&gt;
The direct FIFO design was soon replaced by more versatile and more robust designs which provided direct memory access (DMA) transfers on either end of the communication event. The use of DMA allowed non-blocking sends, where the sender is able to initiate a send and continue with useful computation (or even perform a receive) while the send completes. On the receiving end, the transfer is accepted via a DMA transfer by the message layer into a buffer and queued until the target process performs a matching receive, at which point the data is copying into the address space of the receiving process. The physical topology of the communication network dominated the programming model of these early machines and parallel algorithms were often stated in terms of a specific interconnection topology, e.g., a ring, a grid, or a hypercube. However, to make the machines more generally useful, the designers of the message layers provided support for communication between arbitrary processors, rather than only between physical neighbors. This was originally supported by forwarding the data within the message layer along links in the network. Soon this routing function was moved into the hardware, so each node consisted of a processor with memory, and a switch that could forward messages, called a router. However, in this store and forward approach the time to transfer a message is proportional to the number of hops it takes through the network, so there remained an emphasis on interconnection topology.&lt;br /&gt;
&lt;br /&gt;
The emphasis on network topology was significantly reduced with the introduction of more general purpose networks, which pipelined the message transfer through each of the routers forming the interconnection network. In most modern message passing machines, the incremental delay introduced by each router is small enough that the transfer time is dominated by the time to simply move that data between the processor and the network, not how far it travels.This greatly simplifies the programming model; typically the processors are viewed as simply forming a linear sequence with uniform communication costs.A processor in a message passing machine can name only the locations in its local memory, and it can name each of the processors, perhaps by number or by route. A user process can only name private addresses and other processes; it can transfer data using the send/receive calls.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Advantages ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Portability'''—Message passing is implemented on most parallel platforms.&lt;br /&gt;
    &lt;br /&gt;
&lt;br /&gt;
'''Universality'''—Model makes minimal assumptions about underlying parallel hardware. Message-passing libraries exist on computers linked by networks and on shared and distributed memory multiprocessors.&lt;br /&gt;
    &lt;br /&gt;
&lt;br /&gt;
'''Simplicity'''—Model supports explicit control of memory references for easier debugging.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3121</id>
		<title>1.Message passing</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3121"/>
		<updated>2007-09-06T03:45:37Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Parallel programming requires interaction between the various processes that are simultaneously run on the individual processors and this is enabled by passing messages between the various processors. This important class of parallel machines, called Message-passing architectures, employs complete computers as building blocks including the microprocessor memory and the I/O system and provides communication between processors as explicit I/O operations. This style of architecture has much in common with the network of workstations, or clusters, except that the packaging of nodes is typically much tighter and the network is of much higher capability than a standard local area network.&lt;br /&gt;
&lt;br /&gt;
The world's largest supercomputers are used almost exclusively to run applications which are parallelised using Message Passing. The course covers all the basic knowledge required to write parallel programs using this programming model, and is directly applicable to almost every parallel computer architecture.&lt;br /&gt;
&lt;br /&gt;
Parallel programming by definition involves co-operation between processes to solve a common task. The programmer has to define the tasks that will be executed by the processors, and also how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together.&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing2.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Message Passing ==&lt;br /&gt;
&lt;br /&gt;
In message passing, a substantial distance exists between the programming model and the actual hardware primitives, with user communication performed through operating systems or library calls that perform the low-level actions including the actual communication operation. The most common user-level communication operations on message passing are variants of the send and receive. In its simplest form send specifies a local data buffer that is to be transmitted and a receiving process(typically on a remote processor).Receive specifies a sending process and a local data buffer into which the transmitted data is to be placed.together a matching send and receive causes a data transfer from one processor to another.In most message passing systems, the send process also allows an identifier or tag to be attached to the message, and the receiving operation specifies a matching rule( such as a specific tag from a specific processor)&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing.jpg]]&lt;br /&gt;
&lt;br /&gt;
The combination of a send and a matching receive accomplishes a memory to memory copy, where each end specifies its local data address, and a pair wise synchronization event. There are several possible variants of this synchronization event, depending upon whether the send completes when the receive has been executed, when the send buffer is available for reuse, or when the request has been accepted. Similarly, the receive can potentially wait until a matching send occurs or simply post the receive. Each of these variants have somewhat different semantics and different implementation requirements. Message passing has long been used as a means of communication and synchronization among arbitrary collections of cooperating sequential processes, even on a single processor. Important examples include programming languages, such as CSP and Occam, and common operating systems functions, such as sockets. Parallel programs using message passing are typically quite structured, like their shared-memory counter parts. Most often, all nodes execute identical copies of a program, with the same code and private variables. Usually, processes can name each other using a simple linear ordering of the processes comprising a program. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Typical Structure ==&lt;br /&gt;
&lt;br /&gt;
Early message passing machines provided hardware primitives that were very close to the simple send/receive user-level communication abstraction, with some additional restrictions. A node was connected to a fixed set of neighbors in a regular pattern by point-to-point links that behaved as simple FIFOs. Most early machines were hypercubes, where each node is connected to n other nodes differing by one bit in the binary address, for a total of 2^n nodes, or meshes, where the nodes are connect to neighbors on two or three dimensions. The network topology was especially important in the early message passing machines, because only the neighboring processors could be named in a send or receive operation. The data transfer involved the sender writing into a link and then writing the message until the receiver started reading it, so the send would block until the receive occurred. In modern terms this is called synchronous message passing because the two events coincide in time. The details of moving data were hidden from the programmer in a message passing library, forming a layer of software between send and receive calls and the actual hardware.&lt;br /&gt;
&lt;br /&gt;
[[Image:Hypercube.jpg]] &amp;lt;br&amp;gt;&lt;br /&gt;
Typical structure of an early message passing machines&lt;br /&gt;
&lt;br /&gt;
The direct FIFO design was soon replaced by more versatile and more robust designs which provided direct memory access (DMA) transfers on either end of the communication event. The use of DMA allowed non-blocking sends, where the sender is able to initiate a send and continue with useful computation (or even perform a receive) while the send completes. On the receiving end, the transfer is accepted via a DMA transfer by the message layer into a buffer and queued until the target process performs a matching receive, at which point the data is copying into the address space of the receiving process. The physical topology of the communication network dominated the programming model of these early machines and parallel algorithms were often stated in terms of a specific interconnection topology, e.g., a ring, a grid, or a hypercube. However, to make the machines more generally useful, the designers of the message layers provided support for communication between arbitrary processors, rather than only between physical neighbors. This was originally supported by forwarding the data within the message layer along links in the network. Soon this routing function was moved into the hardware, so each node consisted of a processor with memory, and a switch that could forward messages, called a router. However, in this store and forward approach the time to transfer a message is proportional to the number of hops it takes through the network, so there remained an emphasis on interconnection topology.&lt;br /&gt;
&lt;br /&gt;
The emphasis on network topology was significantly reduced with the introduction of more general purpose networks, which pipelined the message transfer through each of the routers forming the interconnection network. In most modern message passing machines, the incremental delay introduced by each router is small enough that the transfer time is dominated by the time to simply move that data between the processor and the network, not how far it travels.This greatly simplifies the programming model; typically the processors are viewed as simply forming a linear sequence with uniform communication costs.A processor in a message passing machine can name only the locations in its local memory, and it can name each of the processors, perhaps by number or by route. A user process can only name private addresses and other processes; it can transfer data using the send/receive calls.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Advantages ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Portability'''—Message passing is implemented on most parallel platforms.&lt;br /&gt;
    &lt;br /&gt;
&lt;br /&gt;
'''Universality'''—Model makes minimal assumptions about underlying parallel hardware. Message-passing libraries exist on computers linked by networks and on shared and distributed memory multiprocessors.&lt;br /&gt;
    &lt;br /&gt;
'''&lt;br /&gt;
Simplicity'''—Model supports explicit control of memory references for easier debugging.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3120</id>
		<title>1.Message passing</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3120"/>
		<updated>2007-09-06T03:45:11Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Parallel programming requires interaction between the various processes that are simultaneously run on the individual processors and this is enabled by passing messages between the various processors. This important class of parallel machines, called Message-passing architectures, employs complete computers as building blocks including the microprocessor memory and the I/O system and provides communication between processors as explicit I/O operations. This style of architecture has much in common with the network of workstations, or clusters, except that the packaging of nodes is typically much tighter and the network is of much higher capability than a standard local area network.&lt;br /&gt;
&lt;br /&gt;
The world's largest supercomputers are used almost exclusively to run applications which are parallelised using Message Passing. The course covers all the basic knowledge required to write parallel programs using this programming model, and is directly applicable to almost every parallel computer architecture.&lt;br /&gt;
&lt;br /&gt;
Parallel programming by definition involves co-operation between processes to solve a common task. The programmer has to define the tasks that will be executed by the processors, and also how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together.&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing2.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Message Passing ==&lt;br /&gt;
&lt;br /&gt;
In message passing, a substantial distance exists between the programming model and the actual hardware primitives, with user communication performed through operating systems or library calls that perform the low-level actions including the actual communication operation. The most common user-level communication operations on message passing are variants of the send and receive. In its simplest form send specifies a local data buffer that is to be transmitted and a receiving process(typically on a remote processor).Receive specifies a sending process and a local data buffer into which the transmitted data is to be placed.together a matching send and receive causes a data transfer from one processor to another.In most message passing systems, the send process also allows an identifier or tag to be attached to the message, and the receiving operation specifies a matching rule( such as a specific tag from a specific processor)&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing.jpg]]&lt;br /&gt;
&lt;br /&gt;
The combination of a send and a matching receive accomplishes a memory to memory copy, where each end specifies its local data address, and a pair wise synchronization event. There are several possible variants of this synchronization event, depending upon whether the send completes when the receive has been executed, when the send buffer is available for reuse, or when the request has been accepted. Similarly, the receive can potentially wait until a matching send occurs or simply post the receive. Each of these variants have somewhat different semantics and different implementation requirements. Message passing has long been used as a means of communication and synchronization among arbitrary collections of cooperating sequential processes, even on a single processor. Important examples include programming languages, such as CSP and Occam, and common operating systems functions, such as sockets. Parallel programs using message passing are typically quite structured, like their shared-memory counter parts. Most often, all nodes execute identical copies of a program, with the same code and private variables. Usually, processes can name each other using a simple linear ordering of the processes comprising a program. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Typical Structure ==&lt;br /&gt;
&lt;br /&gt;
Early message passing machines provided hardware primitives that were very close to the simple send/receive user-level communication abstraction, with some additional restrictions. A node was connected to a fixed set of neighbors in a regular pattern by point-to-point links that behaved as simple FIFOs. Most early machines were hypercubes, where each node is connected to n other nodes differing by one bit in the binary address, for a total of 2^n nodes, or meshes, where the nodes are connect to neighbors on two or three dimensions. The network topology was especially important in the early message passing machines, because only the neighboring processors could be named in a send or receive operation. The data transfer involved the sender writing into a link and then writing the message until the receiver started reading it, so the send would block until the receive occurred. In modern terms this is called synchronous message passing because the two events coincide in time. The details of moving data were hidden from the programmer in a message passing library, forming a layer of software between send and receive calls and the actual hardware.&lt;br /&gt;
&lt;br /&gt;
[[Image:Hypercube.jpg]] &amp;lt;br&amp;gt;&lt;br /&gt;
Typical structure of an early message passing machines&lt;br /&gt;
&lt;br /&gt;
The direct FIFO design was soon replaced by more versatile and more robust designs which provided direct memory access (DMA) transfers on either end of the communication event. The use of DMA allowed non-blocking sends, where the sender is able to initiate a send and continue with useful computation (or even perform a receive) while the send completes. On the receiving end, the transfer is accepted via a DMA transfer by the message layer into a buffer and queued until the target process performs a matching receive, at which point the data is copying into the address space of the receiving process. The physical topology of the communication network dominated the programming model of these early machines and parallel algorithms were often stated in terms of a specific interconnection topology, e.g., a ring, a grid, or a hypercube. However, to make the machines more generally useful, the designers of the message layers provided support for communication between arbitrary processors, rather than only between physical neighbors. This was originally supported by forwarding the data within the message layer along links in the network. Soon this routing function was moved into the hardware, so each node consisted of a processor with memory, and a switch that could forward messages, called a router. However, in this store and forward approach the time to transfer a message is proportional to the number of hops it takes through the network, so there remained an emphasis on interconnection topology.&lt;br /&gt;
&lt;br /&gt;
The emphasis on network topology was significantly reduced with the introduction of more general purpose networks, which pipelined the message transfer through each of the routers forming the interconnection network. In most modern message passing machines, the incremental delay introduced by each router is small enough that the transfer time is dominated by the time to simply move that data between the processor and the network, not how far it travels.This greatly simplifies the programming model; typically the processors are viewed as simply forming a linear sequence with uniform communication costs.A processor in a message passing machine can name only the locations in its local memory, and it can name each of the processors, perhaps by number or by route. A user process can only name private addresses and other processes; it can transfer data using the send/receive calls.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Advantages ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Portability'''—Message passing is implemented on most parallel platforms.&lt;br /&gt;
    &lt;br /&gt;
'''&lt;br /&gt;
Universality'''—Model makes minimal assumptions about underlying parallel hardware. Message-passing libraries exist on computers linked by networks and on shared and distributed memory multiprocessors.&lt;br /&gt;
    &lt;br /&gt;
'''&lt;br /&gt;
Simplicity'''—Model supports explicit control of memory references for easier debugging.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3119</id>
		<title>1.Message passing</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3119"/>
		<updated>2007-09-06T03:44:37Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Parallel programming requires interaction between the various processes that are simultaneously run on the individual processors and this is enabled by passing messages between the various processors. This important class of parallel machines, called Message-passing architectures, employs complete computers as building blocks including the microprocessor memory and the I/O system and provides communication between processors as explicit I/O operations. This style of architecture has much in common with the network of workstations, or clusters, except that the packaging of nodes is typically much tighter and the network is of much higher capability than a standard local area network.&lt;br /&gt;
&lt;br /&gt;
The world's largest supercomputers are used almost exclusively to run applications which are parallelised using Message Passing. The course covers all the basic knowledge required to write parallel programs using this programming model, and is directly applicable to almost every parallel computer architecture.&lt;br /&gt;
&lt;br /&gt;
Parallel programming by definition involves co-operation between processes to solve a common task. The programmer has to define the tasks that will be executed by the processors, and also how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together.&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing2.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Message Passing ==&lt;br /&gt;
&lt;br /&gt;
In message passing, a substantial distance exists between the programming model and the actual hardware primitives, with user communication performed through operating systems or library calls that perform the low-level actions including the actual communication operation. The most common user-level communication operations on message passing are variants of the send and receive. In its simplest form send specifies a local data buffer that is to be transmitted and a receiving process(typically on a remote processor).Receive specifies a sending process and a local data buffer into which the transmitted data is to be placed.together a matching send and receive causes a data transfer from one processor to another.In most message passing systems, the send process also allows an identifier or tag to be attached to the message, and the receiving operation specifies a matching rule( such as a specific tag from a specific processor)&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing.jpg]]&lt;br /&gt;
&lt;br /&gt;
The combination of a send and a matching receive accomplishes a memory to memory copy, where each end specifies its local data address, and a pair wise synchronization event. There are several possible variants of this synchronization event, depending upon whether the send completes when the receive has been executed, when the send buffer is available for reuse, or when the request has been accepted. Similarly, the receive can potentially wait until a matching send occurs or simply post the receive. Each of these variants have somewhat different semantics and different implementation requirements. Message passing has long been used as a means of communication and synchronization among arbitrary collections of cooperating sequential processes, even on a single processor. Important examples include programming languages, such as CSP and Occam, and common operating systems functions, such as sockets. Parallel programs using message passing are typically quite structured, like their shared-memory counter parts. Most often, all nodes execute identical copies of a program, with the same code and private variables. Usually, processes can name each other using a simple linear ordering of the processes comprising a program. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Typical Structure ==&lt;br /&gt;
&lt;br /&gt;
Early message passing machines provided hardware primitives that were very close to the simple send/receive user-level communication abstraction, with some additional restrictions. A node was connected to a fixed set of neighbors in a regular pattern by point-to-point links that behaved as simple FIFOs. Most early machines were hypercubes, where each node is connected to n other nodes differing by one bit in the binary address, for a total of 2^n nodes, or meshes, where the nodes are connect to neighbors on two or three dimensions. The network topology was especially important in the early message passing machines, because only the neighboring processors could be named in a send or receive operation. The data transfer involved the sender writing into a link and then writing the message until the receiver started reading it, so the send would block until the receive occurred. In modern terms this is called synchronous message passing because the two events coincide in time. The details of moving data were hidden from the programmer in a message passing library, forming a layer of software between send and receive calls and the actual hardware.&lt;br /&gt;
&lt;br /&gt;
[[Image:Hypercube.jpg]] &amp;lt;br&amp;gt;&lt;br /&gt;
Typical structure of an early message passing machines&lt;br /&gt;
&lt;br /&gt;
The direct FIFO design was soon replaced by more versatile and more robust designs which provided direct memory access (DMA) transfers on either end of the communication event. The use of DMA allowed non-blocking sends, where the sender is able to initiate a send and continue with useful computation (or even perform a receive) while the send completes. On the receiving end, the transfer is accepted via a DMA transfer by the message layer into a buffer and queued until the target process performs a matching receive, at which point the data is copying into the address space of the receiving process. The physical topology of the communication network dominated the programming model of these early machines and parallel algorithms were often stated in terms of a specific interconnection topology, e.g., a ring, a grid, or a hypercube. However, to make the machines more generally useful, the designers of the message layers provided support for communication between arbitrary processors, rather than only between physical neighbors. This was originally supported by forwarding the data within the message layer along links in the network. Soon this routing function was moved into the hardware, so each node consisted of a processor with memory, and a switch that could forward messages, called a router. However, in this store and forward approach the time to transfer a message is proportional to the number of hops it takes through the network, so there remained an emphasis on interconnection topology.&lt;br /&gt;
&lt;br /&gt;
The emphasis on network topology was significantly reduced with the introduction of more general purpose networks, which pipelined the message transfer through each of the routers forming the interconnection network. In most modern message passing machines, the incremental delay introduced by each router is small enough that the transfer time is dominated by the time to simply move that data between the processor and the network, not how far it travels.This greatly simplifies the programming model; typically the processors are viewed as simply forming a linear sequence with uniform communication costs.A processor in a message passing machine can name only the locations in its local memory, and it can name each of the processors, perhaps by number or by route. A user process can only name private addresses and other processes; it can transfer data using the send/receive calls.&lt;br /&gt;
&lt;br /&gt;
[[Advantages]]&lt;br /&gt;
&lt;br /&gt;
Portability—Message passing is implemented on most parallel platforms.&lt;br /&gt;
    &lt;br /&gt;
&lt;br /&gt;
Universality—Model makes minimal assumptions about underlying parallel hardware. Message-passing libraries exist on computers linked by networks and on shared and distributed memory multiprocessors.&lt;br /&gt;
    &lt;br /&gt;
&lt;br /&gt;
Simplicity—Model supports explicit control of memory references for easier debugging.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=3.References&amp;diff=3115</id>
		<title>3.References</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=3.References&amp;diff=3115"/>
		<updated>2007-09-06T03:42:19Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[http://en.wikipedia.org/wiki/Blade_server] Wikipedia&lt;br /&gt;
&lt;br /&gt;
[http://www.compactpci-systems.com/columns/software_corner/pdfs/3.03.pdf] www.compactpci-systems.com&lt;br /&gt;
&lt;br /&gt;
[http://www.bladeserverscenter.com/i_technology.shtml] www.bladeserverscenter.com&lt;br /&gt;
&lt;br /&gt;
[http://www.blade.org/techover.cfm] www.blade.org&lt;br /&gt;
&lt;br /&gt;
[http://h41112.www4.hp.com/promo/blades-community/eur/en/library/articles/Both_worldspdf.pdf] www.hp.com&lt;br /&gt;
&lt;br /&gt;
[http://www.terian.com/terianprods.asp?s=Blades] www.terian.com&lt;br /&gt;
&lt;br /&gt;
[http://www.cisco.com/application/pdf/en/us/guest/netsol/ns500/c654/cdccont_0900aecd804ab4ce.pdf]www.cisco.com&lt;br /&gt;
&lt;br /&gt;
[http://www.hpcx.ac.uk/support/training/MPP.html]www.hpcx.ac.uk&lt;br /&gt;
&lt;br /&gt;
David E. Culler, Jaswinder Pal Singh, with Anoop Gupta,&lt;br /&gt;
Parallel Computer Architecture: A Hardware/Software Approach, © 1999 Morgan-Kauffman&lt;br /&gt;
&lt;br /&gt;
Modular Systems: The Evolution of Reliability &lt;br /&gt;
White Paper #76 by Neil Rasmussen Suzanne Niles&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Message_Passing2.jpg&amp;diff=3114</id>
		<title>File:Message Passing2.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Message_Passing2.jpg&amp;diff=3114"/>
		<updated>2007-09-06T03:41:11Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3113</id>
		<title>1.Message passing</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3113"/>
		<updated>2007-09-06T03:41:02Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Parallel programming requires interaction between the various processes that are simultaneously run on the individual processors and this is enabled by passing messages between the various processors. This important class of parallel machines, called Message-passing architectures, employs complete computers as building blocks including the microprocessor memory and the I/O system and provides communication between processors as explicit I/O operations. This style of architecture has much in common with the network of workstations, or clusters, except that the packaging of nodes is typically much tighter and the network is of much higher capability than a standard local area network.&lt;br /&gt;
&lt;br /&gt;
The world's largest supercomputers are used almost exclusively to run applications which are parallelised using Message Passing. The course covers all the basic knowledge required to write parallel programs using this programming model, and is directly applicable to almost every parallel computer architecture.&lt;br /&gt;
&lt;br /&gt;
Parallel programming by definition involves co-operation between processes to solve a common task. The programmer has to define the tasks that will be executed by the processors, and also how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together.&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing2.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Message Passing ==&lt;br /&gt;
&lt;br /&gt;
In message passing, a substantial distance exists between the programming model and the actual hardware primitives, with user communication performed through operating systems or library calls that perform the low-level actions including the actual communication operation. The most common user-level communication operations on message passing are variants of the send and receive. In its simplest form send specifies a local data buffer that is to be transmitted and a receiving process(typically on a remote processor).Receive specifies a sending process and a local data buffer into which the transmitted data is to be placed.together a matching send and receive causes a data transfer from one processor to another.In most message passing systems, the send process also allows an identifier or tag to be attached to the message, and the receiving operation specifies a matching rule( such as a specific tag from a specific processor)&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing.jpg]]&lt;br /&gt;
&lt;br /&gt;
The combination of a send and a matching receive accomplishes a memory to memory copy, where each end specifies its local data address, and a pair wise synchronization event. There are several possible variants of this synchronization event, depending upon whether the send completes when the receive has been executed, when the send buffer is available for reuse, or when the request has been accepted. Similarly, the receive can potentially wait until a matching send occurs or simply post the receive. Each of these variants have somewhat different semantics and different implementation requirements. Message passing has long been used as a means of communication and synchronization among arbitrary collections of cooperating sequential processes, even on a single processor. Important examples include programming languages, such as CSP and Occam, and common operating systems functions, such as sockets. Parallel programs using message passing are typically quite structured, like their shared-memory counter parts. Most often, all nodes execute identical copies of a program, with the same code and private variables. Usually, processes can name each other using a simple linear ordering of the processes comprising a program. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Typical Structure ==&lt;br /&gt;
&lt;br /&gt;
Early message passing machines provided hardware primitives that were very close to the simple send/receive user-level communication abstraction, with some additional restrictions. A node was connected to a fixed set of neighbors in a regular pattern by point-to-point links that behaved as simple FIFOs. Most early machines were hypercubes, where each node is connected to n other nodes differing by one bit in the binary address, for a total of 2^n nodes, or meshes, where the nodes are connect to neighbors on two or three dimensions. The network topology was especially important in the early message passing machines, because only the neighboring processors could be named in a send or receive operation. The data transfer involved the sender writing into a link and then writing the message until the receiver started reading it, so the send would block until the receive occurred. In modern terms this is called synchronous message passing because the two events coincide in time. The details of moving data were hidden from the programmer in a message passing library, forming a layer of software between send and receive calls and the actual hardware.&lt;br /&gt;
&lt;br /&gt;
[[Image:Hypercube.jpg]] &amp;lt;br&amp;gt;&lt;br /&gt;
Typical structure of an early message passing machines&lt;br /&gt;
&lt;br /&gt;
The direct FIFO design was soon replaced by more versatile and more robust designs which provided direct memory access (DMA) transfers on either end of the communication event. The use of DMA allowed non-blocking sends, where the sender is able to initiate a send and continue with useful computation (or even perform a receive) while the send completes. On the receiving end, the transfer is accepted via a DMA transfer by the message layer into a buffer and queued until the target process performs a matching receive, at which point the data is copying into the address space of the receiving process. The physical topology of the communication network dominated the programming model of these early machines and parallel algorithms were often stated in terms of a specific interconnection topology, e.g., a ring, a grid, or a hypercube. However, to make the machines more generally useful, the designers of the message layers provided support for communication between arbitrary processors, rather than only between physical neighbors. This was originally supported by forwarding the data within the message layer along links in the network. Soon this routing function was moved into the hardware, so each node consisted of a processor with memory, and a switch that could forward messages, called a router. However, in this store and forward approach the time to transfer a message is proportional to the number of hops it takes through the network, so there remained an emphasis on interconnection topology.&lt;br /&gt;
&lt;br /&gt;
The emphasis on network topology was significantly reduced with the introduction of more general purpose networks, which pipelined the message transfer through each of the routers forming the interconnection network. In most modern message passing machines, the incremental delay introduced by each router is small enough that the transfer time is dominated by the time to simply move that data between the processor and the network, not how far it travels.This greatly simplifies the programming model; typically the processors are viewed as simply forming a linear sequence with uniform communication costs.A processor in a message passing machine can name only the locations in its local memory, and it can name each of the processors, perhaps by number or by route. A user process can only name private addresses and other processes; it can transfer data using the send/receive calls.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3111</id>
		<title>1.Message passing</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=1.Message_passing&amp;diff=3111"/>
		<updated>2007-09-06T03:40:26Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Parallel programming requires interaction between the various processes that are simultaneously run on the individual processors and this is enabled by passing messages between the various processors. This important class of parallel machines, called Message-passing architectures, employs complete computers as building blocks including the microprocessor memory and the I/O system and provides communication between processors as explicit I/O operations. This style of architecture has much in common with the network of workstations, or clusters, except that the packaging of nodes is typically much tighter and the network is of much higher capability than a standard local area network.&lt;br /&gt;
&lt;br /&gt;
The world's largest supercomputers are used almost exclusively to run applications which are parallelised using Message Passing. The course covers all the basic knowledge required to write parallel programs using this programming model, and is directly applicable to almost every parallel computer architecture.&lt;br /&gt;
&lt;br /&gt;
Parallel programming by definition involves co-operation between processes to solve a common task. The programmer has to define the tasks that will be executed by the processors, and also how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together.&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing.jpg]]&lt;br /&gt;
&lt;br /&gt;
== Message Passing ==&lt;br /&gt;
&lt;br /&gt;
In message passing, a substantial distance exists between the programming model and the actual hardware primitives, with user communication performed through operating systems or library calls that perform the low-level actions including the actual communication operation. The most common user-level communication operations on message passing are variants of the send and receive. In its simplest form send specifies a local data buffer that is to be transmitted and a receiving process(typically on a remote processor).Receive specifies a sending process and a local data buffer into which the transmitted data is to be placed.together a matching send and receive causes a data transfer from one processor to another.In most message passing systems, the send process also allows an identifier or tag to be attached to the message, and the receiving operation specifies a matching rule( such as a specific tag from a specific processor)&lt;br /&gt;
&lt;br /&gt;
[[Image:Message Passing.jpg]]&lt;br /&gt;
&lt;br /&gt;
The combination of a send and a matching receive accomplishes a memory to memory copy, where each end specifies its local data address, and a pair wise synchronization event. There are several possible variants of this synchronization event, depending upon whether the send completes when the receive has been executed, when the send buffer is available for reuse, or when the request has been accepted. Similarly, the receive can potentially wait until a matching send occurs or simply post the receive. Each of these variants have somewhat different semantics and different implementation requirements. Message passing has long been used as a means of communication and synchronization among arbitrary collections of cooperating sequential processes, even on a single processor. Important examples include programming languages, such as CSP and Occam, and common operating systems functions, such as sockets. Parallel programs using message passing are typically quite structured, like their shared-memory counter parts. Most often, all nodes execute identical copies of a program, with the same code and private variables. Usually, processes can name each other using a simple linear ordering of the processes comprising a program. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Typical Structure ==&lt;br /&gt;
&lt;br /&gt;
Early message passing machines provided hardware primitives that were very close to the simple send/receive user-level communication abstraction, with some additional restrictions. A node was connected to a fixed set of neighbors in a regular pattern by point-to-point links that behaved as simple FIFOs. Most early machines were hypercubes, where each node is connected to n other nodes differing by one bit in the binary address, for a total of 2^n nodes, or meshes, where the nodes are connect to neighbors on two or three dimensions. The network topology was especially important in the early message passing machines, because only the neighboring processors could be named in a send or receive operation. The data transfer involved the sender writing into a link and then writing the message until the receiver started reading it, so the send would block until the receive occurred. In modern terms this is called synchronous message passing because the two events coincide in time. The details of moving data were hidden from the programmer in a message passing library, forming a layer of software between send and receive calls and the actual hardware.&lt;br /&gt;
&lt;br /&gt;
[[Image:Hypercube.jpg]] &amp;lt;br&amp;gt;&lt;br /&gt;
Typical structure of an early message passing machines&lt;br /&gt;
&lt;br /&gt;
The direct FIFO design was soon replaced by more versatile and more robust designs which provided direct memory access (DMA) transfers on either end of the communication event. The use of DMA allowed non-blocking sends, where the sender is able to initiate a send and continue with useful computation (or even perform a receive) while the send completes. On the receiving end, the transfer is accepted via a DMA transfer by the message layer into a buffer and queued until the target process performs a matching receive, at which point the data is copying into the address space of the receiving process. The physical topology of the communication network dominated the programming model of these early machines and parallel algorithms were often stated in terms of a specific interconnection topology, e.g., a ring, a grid, or a hypercube. However, to make the machines more generally useful, the designers of the message layers provided support for communication between arbitrary processors, rather than only between physical neighbors. This was originally supported by forwarding the data within the message layer along links in the network. Soon this routing function was moved into the hardware, so each node consisted of a processor with memory, and a switch that could forward messages, called a router. However, in this store and forward approach the time to transfer a message is proportional to the number of hops it takes through the network, so there remained an emphasis on interconnection topology.&lt;br /&gt;
&lt;br /&gt;
The emphasis on network topology was significantly reduced with the introduction of more general purpose networks, which pipelined the message transfer through each of the routers forming the interconnection network. In most modern message passing machines, the incremental delay introduced by each router is small enough that the transfer time is dominated by the time to simply move that data between the processor and the network, not how far it travels.This greatly simplifies the programming model; typically the processors are viewed as simply forming a linear sequence with uniform communication costs.A processor in a message passing machine can name only the locations in its local memory, and it can name each of the processors, perhaps by number or by route. A user process can only name private addresses and other processes; it can transfer data using the send/receive calls.&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=2.Blade_Servers&amp;diff=3110</id>
		<title>2.Blade Servers</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=2.Blade_Servers&amp;diff=3110"/>
		<updated>2007-09-06T03:35:09Z</updated>

		<summary type="html">&lt;p&gt;Kperi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Introduction ==&lt;br /&gt;
&lt;br /&gt;
Blade servers are a revolutionary new concept for enterprise applications currently using a “stack of PC servers” approach. Blade servers promise to greatly increase compute density, reduce cost, improve reliability, and simplify cabling. Companies such as Dell, Hewlett Packard, IBM, RLX, and Sun offer blade server solutions that reduce operating expense while increasing services density. Blade servers form the basis for a modular computing paradigm.&lt;br /&gt;
&lt;br /&gt;
== Evolution ==&lt;br /&gt;
For many years, traditional standalone servers grew larger and faster, taking on more and more tasks as networked computing expanded. New servers were added to data centers as the need arose, often as a quick fix with little coordination or planning; it was not unusual for data center operators to discover that servers had been added without their knowledge. The resulting complexity of boxes and cabling became a growing invitation to confusion, mistakes, and inflexibility.&lt;br /&gt;
&lt;br /&gt;
                                                [[Image:Conventional Servers.jpg]]&lt;br /&gt;
                                                    Figure : Conventional Servers&lt;br /&gt;
Blade servers, first appearing in 2001, are a very simple and pure example of modular architecture – the blades in a blade server chassis are physically identical, with identical processors, ready to be configured and used for any purpose desired by the user. Their introduction brought many benefits of modularity to the server landscape – scalability, ease of duplication, specialization of function, and adaptability.Blade servers were developed in response to a critical and growing need in the datacenter: the requirement to increase server performance and availability without dramatically increasing the size, cost and management complexity of an ever growing data center. To keep up with user demand and because of the space and power demands of traditional tower and rackmount servers, data centers are being forced to expand their physical plant at an alarming rate.&lt;br /&gt;
&lt;br /&gt;
                          [[Image:Blade Server.jpg]]&lt;br /&gt;
&lt;br /&gt;
But while these classic modular advantages have given blade servers a growing presence in data centers, their full potential awaits the widespread implementation of one remaining critical capability of modular design: fault tolerance. Fault tolerant blade servers – ones with built-in “failover” logic to transfer operation from failed to healthy blades – have only recently started to become available and affordable. The reliability of such fault tolerant servers will surpass that of current techniques involving redundant software and clusters of single servers, putting blade servers in a position to become the dominant server architecture of data centers. With the emergence of automated fault tolerance, industry observers predict rapid migration to blade servers over the forthcoming years.&lt;br /&gt;
&lt;br /&gt;
The Terian EdgeXPS® 714-132 is powered by the latest Dual-Core Intel® Xeon® 5100 Series Processor and can support up to 14- dual processor Blade Servers (28 total processors) in a single 7U chassis. &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
                                        [[Image:Terian EdgeXPS® 714-132 .jpg]]&lt;br /&gt;
&lt;br /&gt;
== General blade server architecture ==&lt;br /&gt;
&lt;br /&gt;
A general blade server architecture is shown in the figure. The hardware components of a blade server are the switch blade, chassis (with fans, temperature sensors, etc), and multiple compute blades. Some vendors offer, partner, or plan to partner with companies that provide application specific blades that provide traffic&lt;br /&gt;
conditioning, protection, or network processing prior to the traffic reaching the compute blades. Often, these application specific&lt;br /&gt;
blades may be functionally positioned between the switch blade and compute blades. However, these blades reside in a standard&lt;br /&gt;
compute blade slot.&lt;br /&gt;
&lt;br /&gt;
                                  [[Image:Figure 1.jpg]]&lt;br /&gt;
&lt;br /&gt;
The outside world connects through the rear of the chassis to a switch card in the blade server. The switch card is provisioned to&lt;br /&gt;
distribute packets to blades within the blade server. All these components are wrapped together with network management system&lt;br /&gt;
software provided by the blade server vendor. The specifics on the blade server architecture vary from vendor to vendor. But before&lt;br /&gt;
you discount this as a bunch of proprietary architectures, think again. Remember that IBM and others dramatically advanced and&lt;br /&gt;
proliferated the PC architecture, changing the face of computing forever. &lt;br /&gt;
 &lt;br /&gt;
The blade server industry appears to be headed in the same direction. There are some areas where standardization of blade&lt;br /&gt;
server components will prove helpful. However, blade server vendors ability to quickly adapt and advance their architectures to&lt;br /&gt;
suite specific applications unencumbered by the standards process will prove to accelerate proliferation in the near term.&lt;br /&gt;
&lt;br /&gt;
== Blade Enclosure ==&lt;br /&gt;
&lt;br /&gt;
The enclosure (or chassis) performs many of the non-core computing services found in most computers. Non-blade computers require components that are bulky, hot and space-inefficient, and duplicated across many computers that may or may not be performing at capacity. By locating these services in one place and sharing them between the blade computers, the overall utilization is more efficient. The specifics of which services are provided and how vary by vendor.&lt;br /&gt;
&lt;br /&gt;
'''Power'''&lt;br /&gt;
&lt;br /&gt;
Computers operate over a range of DC voltages, yet power is delivered from utilities as AC, and at higher voltages than required within the computer. Converting this current requires power supply units (or PSUs). To ensure that the failure of one power source does not affect the operation of the computer, even entry-level servers have redundant power supplies, again adding to the bulk and heat output of the design.&lt;br /&gt;
&lt;br /&gt;
The blade enclosure's power supply provides a single power source for all blades within the enclosure. This single power source may be in the form of a power supply in the enclosure or a dedicated separate PSU supplying DC to multiple enclosures [1]. This setup not only reduces the number of PSUs required to provide a resilient power supply, but it also improves efficiency because it reduces the number of idle PSUs. In the event of a PSU failure the blade chassis throttles down individual blade server performance until it matches the available power. This is carried out in steps of 12.5% per CPU until power balance is achieved.&lt;br /&gt;
&lt;br /&gt;
'''Cooling'''&lt;br /&gt;
&lt;br /&gt;
During operation, electrical and mechanical components produce heat, which must be displaced to ensure the proper functioning of the components. In blade enclosures, as in most computing systems, heat is removed with fans.&lt;br /&gt;
&lt;br /&gt;
A frequently underestimated problem when designing high-performance computer systems is the conflict between the amount of heat a system generates and the ability of its fans to remove the heat. The blade's shared power and cooling means that it does not generate as much heat as traditional servers. Newer blade enclosure designs feature high speed, adjustable fans and control logic that tune the cooling to the systems requirements.[2]&lt;br /&gt;
&lt;br /&gt;
At the same time, the increased density of blade server configurations can still result in higher overall demands for cooling when a rack is populated at over 50%. This is especially true with early generation blades. In absolute terms, a fully populated rack of blade servers is likely to require more cooling capacity than a fully populated rack of standard 1U servers.&lt;br /&gt;
&lt;br /&gt;
'''Networking'''&lt;br /&gt;
&lt;br /&gt;
Computers are increasingly being produced with high-speed, integrated network interfaces, and most are expandable to allow for the addition of connections that are faster, more resilient and run over different media (copper and fiber). These may require extra engineering effort in the design and manufacture of the blade, consume space in both the installation and capacity for installation (empty expansion slots) and hence more complexity. High-speed network topologies require expensive, high-speed integrated circuits and media, while most computers do not utilise all the bandwidth available.&lt;br /&gt;
&lt;br /&gt;
The blade enclosure provides one or more network buses to which the blade will connect, and either presents these ports individually in a single location (versus one in each computer chassis), or aggregates them into fewer ports, reducing the cost of connecting the individual devices. These may be presented in the chassis itself, or in networking blades[3].&lt;br /&gt;
&lt;br /&gt;
'''Storage'''&lt;br /&gt;
&lt;br /&gt;
While computers typically need hard-disks to store the operating system, application and data for the computer, these are not necessarily required locally. Many storage connection methods (e.g. FireWire, SATA, SCSI, DAS, Fibre Channel and iSCSI) are readily moved outside the server, though not all are used in enterprise-level installations. Implementing these connection interfaces within the computer presents similar challenges to the networking interfaces (indeed iSCSI runs over the network interface), and similarly these can be removed from the blade and presented individually or aggregated either on the chassis or through other blades.&lt;br /&gt;
&lt;br /&gt;
The ability to boot the blade from a storage area network (SAN) allows for an entirely disk-free blade. This may have higher processor density or better reliability than systems having individual disks on each blade.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Advantages of Blade Servers ==&lt;br /&gt;
&lt;br /&gt;
     '''Reduced Space Requirements''' - Greater density provides up to 35 to 45 percent improvement compared to tower or rackmounted &amp;lt;br&amp;gt;servers.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
     '''Reduced Power Consumption and Improved Power Management''' - consolidating power supplies into the blade chassis reduces the number&amp;lt;br&amp;gt; of separate power supplies needed and reduces the power requirements per server.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
     '''Lower Management Cost''' - server consolidation and resource centralization simplifies server deployment, management and &amp;lt;br&amp;gt;administration and improves management and control.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
    ''' Simplified Cabling''' - rack mount servers, while helping consolidate servers into a centralized location, create wiring &amp;lt;br&amp;gt;proliferation. Blade servers simplify cabling requirements and reduce wiring by up to 70 percent. Power cabling, operator wiring &amp;lt;br&amp;gt;(keyboard, mouse, etc.) and communications cabling (Ethernet, SAN connections, cluster connection) are greatly reduced.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
     '''Future Proofing Through Modularity''' - as new processor, communications, storage and interconnect technology becomes available, it &amp;lt;br&amp;gt;can be implemented in blades that install into existing equipment, upgrading server operation at a minimum cost and with no &amp;lt;br&amp;gt;disruption of basic server functionality.&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
    ''' Easier Physical Deployment''' - once a blade server chassis has been installed, adding additional servers is merely a matter of &amp;lt;br&amp;gt;sliding in additional blades into the chassis. Software management tools simplify the management and reporting functions for blade &amp;lt;br&amp;gt;servers. Redundant power modules and consolidated communication bays simplify integration into datacenters and increase &amp;lt;br&amp;gt;reliability.&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
== Are blade servers an extension of message passing ? ==&lt;br /&gt;
&lt;br /&gt;
Blade servers use message passing in order to achieve fast and efficient performance. Parallel computing frequently relies upon message passing to exchange information between computational units. In high-performance computing, the most common message passing technology is the '''Message Passing Interface (MPI)''', which is being developed in an open-source implementation supported by Cisco Systems® and other vendors.&lt;br /&gt;
&lt;br /&gt;
High performance computing (HPC) Cluster applications require a high performance interconnect for blade servers to achieve fast and efficient performance for computation-intensive applications.When messages are passed between nodes , some time is spent transmitting these messages, and depending on the frequency of the data synchronization between processes, that factor can have a significant effect on total application run time. It is critically important to understand how the application works with respect to interprocess communications patterns and the frequency of updates, because these affect the performance and design of the parallel application, the design of the interconnecting network, and the choice of network technology.&lt;br /&gt;
&lt;br /&gt;
Using traditional transport protocols such as TCP/IP, the CPU is responsible for managing how data is moved between I/O memory and&lt;br /&gt;
for transport protocol processing. The effect of this is that time spent in communicating between nodes is time not spent on processing the application. Therefore, minimizing communications time is a key consideration for certain classes of applications.&lt;br /&gt;
&lt;br /&gt;
MPI is “middleware” software that sits between the application and the network hardware. It provides a portable mechanism to enable messages to be exchanged between processes regardless of the underlying network or parallel computational environment. As such,implementations of the MPI standard use underlying communications stacks such as TCP or UDP over IP, InfiniBand, or Myrinet to communicate between processes. MPI offers a rich set of functions that can be combined in simple or complex ways to solve any type of parallel computation. The ability to exchange messages enables instructions or data to be passed between nodes to distribute data sets for calculation. MPI has been implemented on a wide variety of platforms, operating systems, and cluster and supercomputer architectures.&lt;br /&gt;
&lt;br /&gt;
See Also [http://h41112.www4.hp.com/promo/blades-community/eur/en/library/articles/Both_worldspdf.pdf] '''The best of both worlds&lt;br /&gt;
'''&lt;/div&gt;</summary>
		<author><name>Kperi</name></author>
	</entry>
</feed>