<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Nknichol</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Nknichol"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Nknichol"/>
	<updated>2026-06-03T16:11:02Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62637</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62637"/>
		<updated>2012-04-23T18:03:17Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* mordred2 (Kerlabs) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system also uses message passing (MPI), so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. This extension allows the programmer to treat remote memory accesses as shared memory accesses, effectively abstracting away the message passing required by underlying DSM hardware. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;&amp;gt;https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;ved=0CD8QFjAD&amp;amp;url=http%3A%2F%2Fsoftware.intel.com%2Ffile%2F39450&amp;amp;ei=iZaVT4fEC4Tetgfgtfm1Cw&amp;amp;usg=AFQjCNFCsxJccaKAuD5knOPxC6VQAR-xbQ&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62633</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62633"/>
		<updated>2012-04-23T17:57:08Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Large-Scale Multiprocessor Examples */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system also uses message passing (MPI), so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;&amp;gt;https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;ved=0CD8QFjAD&amp;amp;url=http%3A%2F%2Fsoftware.intel.com%2Ffile%2F39450&amp;amp;ei=iZaVT4fEC4Tetgfgtfm1Cw&amp;amp;usg=AFQjCNFCsxJccaKAuD5knOPxC6VQAR-xbQ&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62632</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62632"/>
		<updated>2012-04-23T17:56:01Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* K Computer */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system also uses message passing (MPI), so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;&amp;gt;https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;ved=0CD8QFjAD&amp;amp;url=http%3A%2F%2Fsoftware.intel.com%2Ffile%2F39450&amp;amp;ei=iZaVT4fEC4Tetgfgtfm1Cw&amp;amp;usg=AFQjCNFCsxJccaKAuD5knOPxC6VQAR-xbQ&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62631</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62631"/>
		<updated>2012-04-23T17:55:21Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system also uses message passing (MPI), so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;&amp;gt;https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;ved=0CD8QFjAD&amp;amp;url=http%3A%2F%2Fsoftware.intel.com%2Ffile%2F39450&amp;amp;ei=iZaVT4fEC4Tetgfgtfm1Cw&amp;amp;usg=AFQjCNFCsxJccaKAuD5knOPxC6VQAR-xbQ&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62630</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62630"/>
		<updated>2012-04-23T17:55:02Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system also uses message passing, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;&amp;gt;https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;ved=0CD8QFjAD&amp;amp;url=http%3A%2F%2Fsoftware.intel.com%2Ffile%2F39450&amp;amp;ei=iZaVT4fEC4Tetgfgtfm1Cw&amp;amp;usg=AFQjCNFCsxJccaKAuD5knOPxC6VQAR-xbQ&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62629</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62629"/>
		<updated>2012-04-23T17:54:31Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;&amp;gt;https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=4&amp;amp;ved=0CD8QFjAD&amp;amp;url=http%3A%2F%2Fsoftware.intel.com%2Ffile%2F39450&amp;amp;ei=iZaVT4fEC4Tetgfgtfm1Cw&amp;amp;usg=AFQjCNFCsxJccaKAuD5knOPxC6VQAR-xbQ&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62628</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62628"/>
		<updated>2012-04-23T17:54:11Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&amp;lt;ref name=&amp;quot;tianhe_mpi&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62603</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62603"/>
		<updated>2012-04-23T17:10:13Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Motivation For Article==&lt;br /&gt;
This aims to briefly outline the architecture of current large scale multiprocessor systems. We will go into detail about several current systems including their manufacturer, physical composition and connecting network, memory consistency models, and how data is kept coherent within the system.&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62601</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62601"/>
		<updated>2012-04-23T17:01:12Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Write-up==&lt;br /&gt;
11b. Current large-scale multiprocessors.  Who sells them, how do you assemble them into systems, what consistency models do they use (briefly, don't redo 10b), do they maintain coherence across the whole system, &amp;amp; how?&lt;br /&gt;
&lt;br /&gt;
==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62305</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62305"/>
		<updated>2012-04-16T22:40:52Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Large-Scale Multiprocessor Examples */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and mordred2 from Kerlabs. Overall, in large cluster computers, message passing is much more popular than distributed shared memory.&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62304</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62304"/>
		<updated>2012-04-16T22:37:44Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* mordred2 (Kerlabs) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory, making it the largest known cluster running Kerrighed.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62303</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62303"/>
		<updated>2012-04-16T22:36:05Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
This cluster computer cost $88 million to build and an additional $20 million per year for electricity and operating expenses.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62302</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62302"/>
		<updated>2012-04-16T22:34:19Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;tianhe&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Tianhe-1&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62301</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62301"/>
		<updated>2012-04-16T22:34:01Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&amp;lt;ref name=&amp;quot;tianhe&amp;quot; /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62300</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62300"/>
		<updated>2012-04-16T22:32:05Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62187</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62187"/>
		<updated>2012-04-16T02:30:07Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* mordred2 (Kerlabs) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
[[Maybe you could make a table of characteristics of these supercomputers ... you could use top500 as a starting point, and add more detailed info on architecture ... though that might be hard to obtain for some.]]&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory.&amp;lt;ref name=&amp;quot;mordred2&amp;quot;/&amp;gt; Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62186</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62186"/>
		<updated>2012-04-16T02:29:10Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
[[Maybe you could make a table of characteristics of these supercomputers ... you could use top500 as a starting point, and add more detailed info on architecture ... though that might be hard to obtain for some.]]&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory. Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kerrighed&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;mordred2&amp;quot;&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62185</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62185"/>
		<updated>2012-04-16T02:26:31Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* mordred2 (Kerlabs) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
[[Maybe you could make a table of characteristics of these supercomputers ... you could use top500 as a starting point, and add more detailed info on architecture ... though that might be hard to obtain for some.]]&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is one of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory. Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62184</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=62184"/>
		<updated>2012-04-16T02:26:11Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. Fujitsu has their own proprietary network, known as the &amp;quot;Tofu Interconnect&amp;quot;. It is a six-dimensional [http://en.wikipedia.org/wiki/Mesh_topology mesh]/[http://en.wikipedia.org/wiki/Torus_interconnect torus] topology. Each set of 12 nodes is called a &amp;quot;node group&amp;quot; and is considered the unit of job allocation. Each node group is connected to adjacent node groups via a three-dimensional torus network. Additionally, the nodes within each node group are adjacently connection via their own three-dimensional mesh/torus. &amp;lt;ref name=&amp;quot;kpdf&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;ktofu&amp;quot;/&amp;gt;&amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
[[Maybe you could make a table of characteristics of these supercomputers ... you could use top500 as a starting point, and add more detailed info on architecture ... though that might be hard to obtain for some.]]&lt;br /&gt;
&lt;br /&gt;
==mordred2 (Kerlabs)==&lt;br /&gt;
The mordred2 is on of several clusters operated by Kerlabs. It is a distributed shared memory system running the open source software Kerrighed. The cluster contains 110 nodes, each with 2 dual-core AMD Opteron processors and 4GB of memory. Its distributed shared memory is provided on the software level by the Linux extension Kerrighed. The software provides sequential consistency, process migration to another node, and checkpointing (the ability to return to a previous application state in case of failure).&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kpdf&amp;quot;&amp;gt;http://www.fujitsu.com/downloads/TC/sc10/interconnect-of-k-computer.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;ktofu&amp;quot;&amp;gt;http://www.fujitsu.com/global/about/tech/k/whatis/network/&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Kerrighed&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref&amp;gt;http://kerrighed.org/php/clusterview.php?id=29&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61812</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61812"/>
		<updated>2012-04-11T18:05:47Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. &amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 4 racks with 8 blades each and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration.&lt;br /&gt;
&lt;br /&gt;
The system uses message passing rather than shared memory, so neither a system-wide cache coherency protocol nor a memory consistency protocol is necessary.&lt;br /&gt;
&lt;br /&gt;
[[Maybe you could make a table of characteristics of these supercomputers ... you could use top500 as a starting point, and add more detailed info on architecture ... though that might be hard to obtain for some.]]&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61796</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61796"/>
		<updated>2012-04-11T17:17:14Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. &amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 32 blades and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
The Arch interconnect uses point-to-point connections in a hybrid fat tree configuration. &lt;br /&gt;
&lt;br /&gt;
[[Maybe you could make a table of characteristics of these supercomputers ... you could use top500 as a starting point, and add more detailed info on architecture ... though that might be hard to obtain for some.]]&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61788</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61788"/>
		<updated>2012-04-11T17:08:07Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two] [[How 'bout IBM's large systems--Blue Gene, etc]].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. &amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;  [[What topology?  Surely not 95^2 links!]]&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 4.701 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each computer cabinet are 32 blades and a 16 port switch. A single blade contains 2 computer nodes each containing 2 Xeon processors and 1 Nvidia GPU. This comes to a total of 3584 blades. These individual nodes are connected using a high-speed interconnect called Arch, which has a bandwidth of 160 Gbps.&lt;br /&gt;
&lt;br /&gt;
[[Maybe you could make a table of characteristics of these supercomputers ... you could use top500 as a starting point, and add more detailed info on architecture ... though that might be hard to obtain for some.]]&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61781</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61781"/>
		<updated>2012-04-11T16:52:45Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Tianhe-1A */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. &amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
The Tianhe-1A, sponsored by the National University of Defense Technology in China, is capable of 2.566 petaFLOPS. It is comprised of 14,336 Xeon X5670 processors and 7,168 Nvidia GP-GPUs. In addition to the Xeon and Nvidia chips, there are 2048 FeiTeng 1000 processors.&lt;br /&gt;
&lt;br /&gt;
All of these processors are contained in 112 computer cabinets, 12 storage cabinets, 6 communication cabinets, and 8 I/O cabinets. In each cabinet are 32 blades and a 16 port switch.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61780</id>
		<title>CSC 456 Spring 2012/11a NC</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/11a_NC&amp;diff=61780"/>
		<updated>2012-04-11T16:42:34Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Large-Scale Multiprocessor Examples==&lt;br /&gt;
&lt;br /&gt;
Some examples of large-scale multiprocessor systems include Fujitsu's K Computer, the Tianhe-1A from the National Supercomputer Center in Tianjin, China, and [another example or two].&lt;br /&gt;
&lt;br /&gt;
==K Computer==&lt;br /&gt;
&lt;br /&gt;
Made by [http://www.fujitsu.com/global/ Fujitsu], the K Computer consists of 88,128 processors between 864 cabinets. Each cabinet contains 96 nodes which, in turn, each contain one processor and 16 GBytes of memory. &amp;lt;ref name=&amp;quot;kprocs&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The system is networked together via [http://en.wikipedia.org/wiki/Point-to-point_(network_topology)#Point-to-point point-to-point], or direct, connection. &amp;lt;ref name=&amp;quot;knetwork&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The K Computer is not a [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory] (DSM) machine in which the physically separate nodes are addressed as one logically shared address space. Instead, the K Computer utilizes a [http://en.wikipedia.org/wiki/Message_Passing_Interface message passing interface] (MPI), allowing the nodes to pass messages to one another as needed.&lt;br /&gt;
&lt;br /&gt;
==Tianhe-1A==&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;kprocs&amp;quot;&amp;gt;http://en.wikipedia.org/wiki/K_computer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;ref name=&amp;quot;knetwork&amp;quot;&amp;gt;http://www.riken.jp/engn/r-world/info/release/pamphlet/aics/pdf/2010_09.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60031</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60031"/>
		<updated>2012-03-19T17:56:40Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3             &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                      1                                    &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                         &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                  3                                   &lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                2                                      &lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        &lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                   3                                          &lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                   3                                    &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|350px|'''Figure 2''' Example of a centralized system. Here, one processor monitors all pages in the cache.]]&lt;br /&gt;
[[Image:Distributed.png|thumbnail|right|350px|'''Figure 3''' Example of a distributed system. Here, each processor monitors a different subset of pages in the cache.]]&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager (figure 2) and distributed manager (figure 3). In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60028</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60028"/>
		<updated>2012-03-19T17:55:02Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3             &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                      1                                    &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                         &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                  3                                   &lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                2                                      &lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        &lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                   3                                          &lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                   3                                    &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|350px|Example of a centralized system. Here, one processor monitors all pages in the cache.]]&lt;br /&gt;
[[Image:Distributed.png|thumbnail|right|350px|Example of a distributed system. Here, each processor monitors a different subset of pages in the cache.]]&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60025</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60025"/>
		<updated>2012-03-19T17:53:50Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3             &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                      1                                    &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                         &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                  3                                   &lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                2                                      &lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        &lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                   3                                          &lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                   3                                    &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|400px|Example of a centralized system]]&lt;br /&gt;
[[Image:Distributed.png|thumbnail|right|400px|Example of a distributed system]]&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60024</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60024"/>
		<updated>2012-03-19T17:53:09Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3             &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                      1                                    &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                         &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                  3                                   &lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                2                                      &lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        &lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                   3                                          &lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                   3                                    &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|500px|Example of a centralized system]]&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|500px|Example of a distributed system]]&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60023</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60023"/>
		<updated>2012-03-19T17:52:55Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3             &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                      1                                    &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                         &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                  3                                   &lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                2                                      &lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        &lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                   3                                          &lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                   3                                    &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|600px|Example of a centralized system]]&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|600px|Example of a distributed system]]&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60022</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60022"/>
		<updated>2012-03-19T17:52:32Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3             &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                      1                                    &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                         &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                  3                                   &lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                2                                      &lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        &lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                   3                                          &lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                   3                                    &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|600px|Example of a centralized system]]&lt;br /&gt;
[[Image:Centralized.png|thumbnail|right|600px|Example of a distributed system]]&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Distributed.png&amp;diff=60019</id>
		<title>File:Distributed.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Distributed.png&amp;diff=60019"/>
		<updated>2012-03-19T17:50:50Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: uploaded a new version of &amp;amp;quot;File:Distributed.png&amp;amp;quot;: Example of a distributed SVM coherence system.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Centralized.png&amp;diff=60017</id>
		<title>File:Centralized.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Centralized.png&amp;diff=60017"/>
		<updated>2012-03-19T17:50:23Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: uploaded a new version of &amp;amp;quot;File:Centralized.png&amp;amp;quot;: Example of a centralized SVM coherence system.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Example of a centralized SVM coherence system&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60014</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60014"/>
		<updated>2012-03-19T17:48:43Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3             &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                      1                                    &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                &lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                         &lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                  3                                   &lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                2                                      &lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        &lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                   3                                          &lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                   3                                    &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
[[File:Centralized.png]]&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[File:Distributed.png]]&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Distributed.png&amp;diff=60013</id>
		<title>File:Distributed.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Distributed.png&amp;diff=60013"/>
		<updated>2012-03-19T17:48:32Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: uploaded a new version of &amp;amp;quot;File:Distributed.png&amp;amp;quot;: Exampled of a distributed SVM coherence system.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Centralized.png&amp;diff=60010</id>
		<title>File:Centralized.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Centralized.png&amp;diff=60010"/>
		<updated>2012-03-19T17:47:16Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: Example of a centralized SVM coherence system&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Example of a centralized SVM coherence system&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60002</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60002"/>
		<updated>2012-03-19T17:36:00Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Cache Coherence Problem */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3              ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60001</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60001"/>
		<updated>2012-03-19T17:35:40Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Cache Coherence Protocols */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3              ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60000</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=60000"/>
		<updated>2012-03-19T17:35:10Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3              ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;5body&amp;quot;&amp;gt;[[#5foot|[5]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59999</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59999"/>
		<updated>2012-03-19T17:34:26Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory semantics in Uniprocessor systems */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;4body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3              ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59997</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59997"/>
		<updated>2012-03-19T17:33:23Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Resources */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3              ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;5foot&amp;quot;&amp;gt;[[#5body|5.]]&amp;lt;/span&amp;gt;http://dl.acm.org/citation.cfm?id=75105&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59996</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59996"/>
		<updated>2012-03-19T17:32:27Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||       3              ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                3                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            1                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                            3                                      ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                         2                                    ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    1                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                               3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small.&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59993</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59993"/>
		<updated>2012-03-19T17:30:21Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||                     ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                8                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                                                                  ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    2                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                                                                       ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
The distributed manager is similar to a centralized manager, but instead of one processor monitoring all pages, a subset of the pages is given to each processor. So, processor 0 would only monitor pages 1 through i, and processor 1 would only monitor pages i+1 through n.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59992</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59992"/>
		<updated>2012-03-19T17:28:17Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
{| class=&amp;quot;wikitable sortable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|+ Sortable Table of Memory Consistency Models&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; class=&amp;quot;unsortable&amp;quot; | Type of Constancy Modle&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Strictness rank, 1-10(lowest-highest)&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Requires Programmer annotation&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]||                     ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]||                8                                        ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]||                            3                                        ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Release_consistency release consistency]||                                                                  ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]||                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Delta_consistency delta consistency]||                                                                          ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]||    2                ||&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]||                               1                                          ||&lt;br /&gt;
|- &lt;br /&gt;
|[http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]||                                                     ||&lt;br /&gt;
|-&lt;br /&gt;
| fork consistency||                                                                                                                                                      ||&lt;br /&gt;
|-&lt;br /&gt;
|[http://en.wikipedia.org/wiki/Serializability serializability]||                                                                                       ||&lt;br /&gt;
|- &lt;br /&gt;
|one-copy serializability||                                                                                                                                             ||&lt;br /&gt;
|- &lt;br /&gt;
|entry consistency||                                                                                                                                                       ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
Two classes of algorithms for solving the memory coherence problem are centralized manager and distributed manager. In the centralized system, one processor is designated the &amp;quot;monitor&amp;quot;. This processor contains a list of information for each page in the cache. The list includes the owner of the page (which processor accessed it last) and all the processors that have copies of the page. When a page is invalidated, then invalidation message is only sent to processors that have copies of the page. This differs from a bus-based system, as the invalidation message is broadcasted to all processors. A drawback to the centralized manager is that there is a bottleneck at the monitor processor. This bottleneck is alleviated by using a distributed manager.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59985</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59985"/>
		<updated>2012-03-19T17:05:48Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Coherence and Shared Virtual Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Release_consistency release consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Delta_consistency delta consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]&lt;br /&gt;
* fork consistency&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Serializability serializability]&lt;br /&gt;
* one-copy serializability&lt;br /&gt;
* entry consistency&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”) and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59983</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59983"/>
		<updated>2012-03-19T16:59:28Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Consistency Models */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, at least partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Release_consistency release consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Delta_consistency delta consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]&lt;br /&gt;
* fork consistency&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Serializability serializability]&lt;br /&gt;
* one-copy serializability&lt;br /&gt;
* entry consistency&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59981</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59981"/>
		<updated>2012-03-19T16:59:13Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Memory Consistency Problem */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preserves the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, atleast partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Release_consistency release consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Delta_consistency delta consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]&lt;br /&gt;
* fork consistency&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Serializability serializability]&lt;br /&gt;
* one-copy serializability&lt;br /&gt;
* entry consistency&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59980</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59980"/>
		<updated>2012-03-19T16:57:18Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* MOSI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an owned state. The owned state means that the processor &amp;quot;owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, atleast partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Release_consistency release consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Delta_consistency delta consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]&lt;br /&gt;
* fork consistency&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Serializability serializability]&lt;br /&gt;
* one-copy serializability&lt;br /&gt;
* entry consistency&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59978</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59978"/>
		<updated>2012-03-19T16:56:52Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* MESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an Owned state. The Owned state means that the processor &amp;quot;Owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the Owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, atleast partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Release_consistency release consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Delta_consistency delta consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]&lt;br /&gt;
* fork consistency&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Serializability serializability]&lt;br /&gt;
* one-copy serializability&lt;br /&gt;
* entry consistency&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59977</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59977"/>
		<updated>2012-03-19T16:55:11Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* MSI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the Shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an Owned state. The Owned state means that the processor &amp;quot;Owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the Owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, atleast partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Release_consistency release consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Delta_consistency delta consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]&lt;br /&gt;
* fork consistency&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Serializability serializability]&lt;br /&gt;
* one-copy serializability&lt;br /&gt;
* entry consistency&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59563</id>
		<title>CSC 456 Spring 2012/ch7 MN</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC_456_Spring_2012/ch7_MN&amp;diff=59563"/>
		<updated>2012-03-14T18:01:48Z</updated>

		<summary type="html">&lt;p&gt;Nknichol: /* Cache Coherence Problem */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction==&lt;br /&gt;
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization].&lt;br /&gt;
&lt;br /&gt;
=Cache Coherence Problem=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:sharedmem.jpg|thumbnail|right|600px|Shared Memory system with dedicated Cache for each processor&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;3body&amp;quot;&amp;gt;[[#4foot|[4]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;]]	&lt;br /&gt;
&lt;br /&gt;
In a system with single processor (single core), maintaining cache coherence is simple and easy but in a multiprocessor system, it is much more complicated. Data can be present in any processor's cache and the protocol needs to ensure that the data is same in all caches. If it cannot ensure that all the caches are same, then it needs to flag a cache line indicating that it is not updated.  &lt;br /&gt;
&lt;br /&gt;
In the figure shown here, this is a 4 processor shared memory system where each processor has its own cache.  Suppose processor P1 reads memory location M1 and stores it in its local cache. Then, if P2 reads same location memory location then M1 gets stored in P2’s cache. Now, if P1 changes value of M1, two copies of same data, residing in different caches will become different. When P2 operates on M1, it uses the stale value of M1 that was stored in its cache. It is the responsibility of cache coherence protocol to prevent this. Hardware support is needed to provide a coherent view of data in multiple caches. This is known as write propagation requirement.&lt;br /&gt;
&lt;br /&gt;
One may think that cache write policy can provide cache coherence, but this is incorrect. Cache write policy only controls how a change in value of cache is propagated to lower level cache or main memory. It is not responsible for propagating changes to other caches.&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Protocols==&lt;br /&gt;
&lt;br /&gt;
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are '''update''' and '''invalidate'''. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable.&amp;lt;sup&amp;gt;&amp;lt;span id=&amp;quot;2body&amp;quot;&amp;gt;[[#3foot|[3]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory. &lt;br /&gt;
&lt;br /&gt;
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.&lt;br /&gt;
&lt;br /&gt;
In order to improve cache coherence performance over the years, several protocols have been proposed.&lt;br /&gt;
&lt;br /&gt;
===MSI===&lt;br /&gt;
&lt;br /&gt;
MSI stands for modified, shared, and invalid, based on the three states that a line of cache can be in. The modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).&lt;br /&gt;
&lt;br /&gt;
A drawback to this protocol occurs when a single processor wants to read blocks and then write to them without another processor sharing that block. After reading the block, a bus transaction places the block into a shared state. The write then occurs and another bus transaction is sent to invalidate the shared copy. This second transaction is useless as no other processors are sharing the block, but the MSI protocol has no way to specify this.&lt;br /&gt;
&lt;br /&gt;
===MESI===&lt;br /&gt;
&lt;br /&gt;
MESI stands for Modified, Exclusive, Shared, and Invalid. The modified and invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the Shared state indicates that the variable is contained in more than one cache.&lt;br /&gt;
&lt;br /&gt;
===MOSI===&lt;br /&gt;
&lt;br /&gt;
The MOSI protocol is identical to the MSI protocol except that it adds an Owned state. The Owned state means that the processor &amp;quot;Owns&amp;quot; the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the Owning cache much, much, faster.&lt;br /&gt;
&lt;br /&gt;
===MOESI===&lt;br /&gt;
&lt;br /&gt;
The MOESI protocol is a combination of the MESI and MOSI protocols.&lt;br /&gt;
&lt;br /&gt;
=Memory Consistency Problem=&lt;br /&gt;
&lt;br /&gt;
Memory consistency deals with the ordering of memory operations (load and store) to different memory locations.  In a single processor system, code will execute correctly if the compiler preservers the order of the access to synchronization variables and other dependent variables.  But in shared memory model with multiple processors, two threads could access a shared data element, such as a synchronization variable, the output of the threads would change based on which thread, accesses the shared data element earlier.  If this were to occur, then the program output may not be the value expected. Maintaining program order is very important for memory consistency but it comes with performance degradation.&lt;br /&gt;
&lt;br /&gt;
== Memory Consistency Models ==&lt;br /&gt;
&lt;br /&gt;
The memory consistency model of a shared-memory multiprocessor is a formal speciﬁcation of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read, in a shared-memory program execution.&lt;br /&gt;
&lt;br /&gt;
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, atleast partially. &lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in Uniprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order speciﬁed by the program. Thus, one can expect for the value in a particular location be the same for a read as it was for the last write because of the sequential program execution. It is sufﬁcient to only maintain uniprocessor data and control dependences.  The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]&lt;br /&gt;
&lt;br /&gt;
=== Memory semantics in multiprocessor systems ===&lt;br /&gt;
&lt;br /&gt;
Programmer's implicit expectations are:&lt;br /&gt;
&lt;br /&gt;
* memory accesses in a processor takes place according to the program order.&lt;br /&gt;
* Each memory access is performed atomically.&lt;br /&gt;
&lt;br /&gt;
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are:&lt;br /&gt;
&lt;br /&gt;
* They support out-of-order execution within individual CPUs&lt;br /&gt;
* Relaxes latency issues with near-simultaneous accesses by different CPUs&lt;br /&gt;
&lt;br /&gt;
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:&lt;br /&gt;
&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Linearizability linearizability (also known as strict or atomic consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Causal_consistency causal consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Release_consistency release consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Eventual_consistency eventual consistency ]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Delta_consistency delta consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/PRAM_consistency PRAM consistency (also known as FIFO consistency)]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Weak_consistency weak consistency]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Vector-field_consistency vector-field consistency]&lt;br /&gt;
* fork consistency&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Serializability serializability]&lt;br /&gt;
* one-copy serializability&lt;br /&gt;
* entry consistency&lt;br /&gt;
&lt;br /&gt;
== Memory Coherence and Shared Virtual Memory ==&lt;br /&gt;
&lt;br /&gt;
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]&lt;br /&gt;
&lt;br /&gt;
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page&lt;br /&gt;
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence.&lt;br /&gt;
&lt;br /&gt;
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols.&lt;br /&gt;
&lt;br /&gt;
= Synchronization =&lt;br /&gt;
&lt;br /&gt;
Two related concepts dealing with synchronization are '''process synchronization''' and '''data synchronization'''. Process synchronization is concerned with different processes committing to a certain sequence of actions. Data synchronization deals with maintaining data integrity across various copies of a dataset. Process synchronization primitives can be used to implement data synchronization. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:&lt;br /&gt;
&lt;br /&gt;
* Disabling interrupts&lt;br /&gt;
* Locks&lt;br /&gt;
* Mutex&lt;br /&gt;
* Semaphores&lt;br /&gt;
* Barriers&lt;br /&gt;
* Test and Set&lt;br /&gt;
&lt;br /&gt;
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.&lt;br /&gt;
&lt;br /&gt;
== Hardware support ==&lt;br /&gt;
&lt;br /&gt;
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations.&lt;br /&gt;
&lt;br /&gt;
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same.&lt;br /&gt;
&lt;br /&gt;
=Resources=&lt;br /&gt;
&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_jp#Memory_Consistency_Problem &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2011/ch7_ss#Cache_Coherence &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;3foot&amp;quot;&amp;gt;[[#3body|3.]]&amp;lt;/span&amp;gt; http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;4foot&amp;quot;&amp;gt;[[#4body|4.]]&amp;lt;/span&amp;gt;http://web.sfc.keio.ac.jp/~rdv/keio/sfc/teaching/architecture/architecture-2007/lec08.html&lt;/div&gt;</summary>
		<author><name>Nknichol</name></author>
	</entry>
</feed>