<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Shvemuri</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Shvemuri"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Shvemuri"/>
	<updated>2026-06-08T21:30:22Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/12b_sl&amp;diff=74919</id>
		<title>CSC/ECE 506 Spring 2013/12b sl</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/12b_sl&amp;diff=74919"/>
		<updated>2013-04-17T18:44:23Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;On-chip interconnects&lt;br /&gt;
__TOC__ &lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The current trend in microprocessor design has shifted from extracting ever increasing performance gains from single core architecture to leveraging the power of multiple cores per die.  This creates new challenges not present in single core systems.  A multi core processor must have a method of passing information between processing cores that is efficient in terms of power consumed, space used on die, and the speed at which messages are delivered.  As physical wire widths are decreased and the number of wires is increased, the difference between gate delay and wire delay is exacerbated.[[#References|[14]]]  To combat these challenges, much research has been done in the area of on-chip networks.&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
On-chip interconnects are a natural extension of the high integration levels that nowadays are reached with multiprocessor integration. Moore's law predicted that the number of transistors in an integrated circuit doubles every two years. This assumption has driven the integration of on-chip components and continues to show the way in the semiconductor industry.&lt;br /&gt;
[[File:Itr MIC image 920x460.png|thumb|c|right|Intel® MIC]]&lt;br /&gt;
In recent years, the main players in the chip industry keep racing to provide more cores integrated in a chip, with the multi-core (more than one core) and many-core (multi-core with so many cores that the historical multi-core techniques are not efficient any longer) technologies. This integration is known as [http://en.wikipedia.org/wiki/Multi-core_(computing) CMP] (chip multiprocessor) and lately Intel has coined the term [http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html Intel® Many Integrated Core (Intel® MIC)].&lt;br /&gt;
&lt;br /&gt;
To make feasible the communication in between these many cores inside of a single chip, the traditional off-chip network has proved to have limited applications. According to [[#References|[2]]], the off-chip designs suffered from I/O bottlenecks which are a diminished problem for on-chip technologies as the internal wiring provides much higher bandwidth and overcomes the delay associated with the external traffic. Nevertheless, the on-chip designs still have some challenges that need further study. Among some of these issues are power consumption and space constraints.&lt;br /&gt;
&lt;br /&gt;
=== Terminology ===&lt;br /&gt;
Some common terms:&lt;br /&gt;
* [http://en.wikipedia.org/wiki/System_on_a_chip SoCs] (Systems-on-a-chip), which commonly refer to chips that are made for a specific application or domain area.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/MPSoC MPSoCs] (Multiprocessor systems-on-chip), referring to a SoC that uses multi-core technology.&lt;br /&gt;
It is interesting to note that for the particular theme of this article, there are at least three different acronyms referring to the same term. These are new technologies and different researchers have adopted different nomenclature. The acronyms are:&lt;br /&gt;
* NoC (network-on-chip), this is the most common term and also used in this article&lt;br /&gt;
* OCIN (on-chip interconnection network) &lt;br /&gt;
* OCN (on-chip network)&lt;br /&gt;
&lt;br /&gt;
== Topologies ==&lt;br /&gt;
Topology refers to the layout or arrangement of interconnections among the processing elements. In general, a good topology aims to minimize network latency and maximize throughput.&lt;br /&gt;
There are certain metrics that help with the classification and comparison of the different topology types. Some of them are defined in Solihin's [[#References|[3]]] textbook in chapter 12.&lt;br /&gt;
&lt;br /&gt;
*'''Degree''' is defined as the number of nodes that are neighbors to, or in other words, can be reached from it in one hop&lt;br /&gt;
*'''Hop count''' is the number of nodes through which a message needs to go through to get to the destination&lt;br /&gt;
*'''Diameter''' is the maximum hop count&lt;br /&gt;
*'''Path diversity''' is useful for the routing algorithm and is given by the amount of shortest paths that a topology offers between two nodes.&lt;br /&gt;
*'''Bisection width''' is the smallest number of wires you have to cut to separate the network into two halves&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Topologies can be classified as direct and indirect topologies.&lt;br /&gt;
In a direct topology, each node is connected to other nodes, which are named neighbouring nodes. Each node contains a network interface acting as a router in order to transfer information.&lt;br /&gt;
In an indirect topology, there are nodes that are no computational but act as switches to transfer the traffic among the rest of the nodes, including other switches. It is called indirect because packets are switched through specific elements that are not part of the computational nodes themselves.&lt;br /&gt;
&lt;br /&gt;
An example of direct topologies is 2-D Mesh. An example of indirect topology is Flattened Butterfly.  &lt;br /&gt;
&lt;br /&gt;
There are many different topologies that could be introduced in this section. Some of the missing topologies include but are not limited to:&lt;br /&gt;
&lt;br /&gt;
* Hypercube&lt;br /&gt;
* Shuffle-exchange&lt;br /&gt;
* Torus&lt;br /&gt;
* Trees&lt;br /&gt;
&lt;br /&gt;
They are just cited here for completion, related information can be found at [http://www.cs.cf.ac.uk/Parallel/Year2/section5.html Interconnection Networks]&lt;br /&gt;
&lt;br /&gt;
===Rings===&lt;br /&gt;
[http://en.wikipedia.org/wiki/Ring_network Ring topologies] can be effective when the “number of cores is still relatively small but is larger than what can be supported using a bus” [Solihin 409]. Such cases are considered to use “medium-scale” interconnection networks.&lt;br /&gt;
&lt;br /&gt;
=== 2-D Mesh ===&lt;br /&gt;
[[File:Mesh.png|thumb|c|right|upright=0.75|2D Mesh]]This has been a very popular topology due to its simple design and low layout and router complexity. It is often described as a k-ary n-cube , where k is the number of nodes on each dimension, and n is the number of dimensions. For example, a 4-ary 2-cube is a 4x4 2D mesh.&lt;br /&gt;
Another advantage is that this topology is similar to the physical die layout, making it more suitable to implement in tiled architectures. For reference, the combination of the switch and a processor is named ''tile''.&lt;br /&gt;
&lt;br /&gt;
But not everything are advantages in this topology. One of the drawbacks of 2D Meshes is that the degree of the nodes along the edges is lower than the degree of the central nodes. This makes the 2D Mesh asymmetrical along the edges, therefore from the networking perspective, there is less demand for edge channels than for central channels.&lt;br /&gt;
&lt;br /&gt;
Jerger and Peh [[#References|[2]]], provide the following information on parameters for a mesh as defined as a k-ary n-cube:&lt;br /&gt;
*the switch degree for a 2D mesh would be 4, as its network requires two channels in each dimension or 2n, although some ports on the edge will be unused.&lt;br /&gt;
*average minimum hop count: &lt;br /&gt;
:{| {{table}}&lt;br /&gt;
| nk/3|| ||k even&lt;br /&gt;
|-&lt;br /&gt;
| n(k/3-1/3k)|| ||k odd&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
*the channel load across the bisection of a mesh under uniform random traffic with an even k is k/4&lt;br /&gt;
*meshes provide diversity of paths for routing messages&lt;br /&gt;
&lt;br /&gt;
=== Concentration Mesh ===&lt;br /&gt;
[[File:Concentratedmesh.png|thumb|c|right|upright=0.75|Concentration Mesh]] This is an evolution of the mesh topology. There is no real need to have a 1:1 relationship between the number of cores and the number of switches/routers. The Concentration mesh reduces the ratio to 1:4, i.e. each router serves four computing nodes. &lt;br /&gt;
&lt;br /&gt;
The advantage over the simple mesh is the decrease in the average hop count. This is important in terms of scaling the solution. But it is not as scalable as it could seem, as its degree is confined to the crossbar complexity [[#References|[1]]]&lt;br /&gt;
&lt;br /&gt;
The reduction in the ratio introduces a lower bisection channel count, but it can be avoided by introducing express channels, as demonstrated in [[#References|[4]]].&lt;br /&gt;
&lt;br /&gt;
Another drawback is that the port bandwidth can become a bottleneck in periods of high traffic.&lt;br /&gt;
&lt;br /&gt;
=== Flattened Butterfly ===&lt;br /&gt;
[[File:Flbfly.png|thumb|c|right|upright=0.75|Flattened butterfly]]A butterfly topology is often described as a k-ary n-fly, which implies k&amp;lt;sup&amp;gt;n&amp;lt;/sup&amp;gt; network nodes with n stages of k&amp;lt;sup&amp;gt;n−1&amp;lt;/sup&amp;gt; k × k intermediate routing nodes. The degree of each intermediate router is 2k.  &lt;br /&gt;
The ﬂattened butterﬂy is made by ﬂattening (i.e. combining) the routers in each row of a butterﬂy topology while preserving the inter-router connections. It does non-minimal routing for load balancing improvement in the network.&lt;br /&gt;
Some advantages are that the maximum distance between nodes is two hops and it has lower latency and better throughput than that of the mesh topology.&lt;br /&gt;
For the disadvantages, it has high channel count (k&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;/2 per row/column), low channel utilization, and increased control complexity.&lt;br /&gt;
The flattened butterfly offers the benefits of a tree (less constraints on root-level bandwidth [Solihin 367]) as well as the ability to actually be mapped to a substrate, but because of node concentration&amp;lt;ref name=&amp;quot;GrotKeckler&amp;quot;/&amp;gt; the number of channels required for high scalability is cost- and validation-prohibitive.&lt;br /&gt;
&lt;br /&gt;
===Crossbar Switch===&lt;br /&gt;
A crossbar switch topology uses a bus arrangement with the bus lines physically perpendicular to each other and whose intersections are connected or disconnected with a switch. In the case of [http://en.wikipedia.org/wiki/Multi-core_(computing) CMPs], this switch is a transistor or, depending on the desired characteristics of the system, a programmable fuse. Due to their ability to be [http://en.wikipedia.org/wiki/Multistage_interconnection_networks multi-staged]&amp;lt;ref name=&amp;quot;wikicrossbarsemi&amp;quot;&amp;gt;&amp;quot;[http://en.wikipedia.org/wiki/Crossbar_switch#Semiconductor Crossbar switch].&amp;quot; Wikipedia. Last accessed April 24, 2012.&amp;lt;/ref&amp;gt;, these topologies lend themselves to being used for memory in large-scale systems. The IBM Cyclops64 architecture is an example of the implementation of this architecture&amp;lt;ref name=&amp;quot;cyclops64&amp;quot;&amp;gt;Zhang, Ying Ping. &amp;quot;[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1639301 A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture]. April 2006. IEEE Xplore.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Multidrop Express Channels (MECS) ===&lt;br /&gt;
[[File:Mecs.png|thumb|c|right|upright=0.75|MECS]] Multidrop Express Channels was proposed in [[#References|[1]]] by Grot and Keckler. Their motivation was that performance and scalability should be obtained by managing wiring. &lt;br /&gt;
Multidrop Express Channels is defined by its authors as a &amp;quot;one to-many communication fabric that enables a high degree of connectivity in a bandwidth-efﬁcient manner.&amp;quot;  Based on point-to-point unidirectional links. This makes for a high degree of connectivity with fewer bisection channels and higher bandwidth for each channel. &lt;br /&gt;
&lt;br /&gt;
Some of the parameters calculated for MECS are:&lt;br /&gt;
*Bisection channel count per each row/column is equal to k.&lt;br /&gt;
*Network diameter (maximum hop count) is two.&lt;br /&gt;
*The number of nodes accessible through each channel ranges from 1 to k − 1.&lt;br /&gt;
*A node has 1 output port per direction&lt;br /&gt;
*The input port count is 2(k − 1)&lt;br /&gt;
&lt;br /&gt;
The low channel count and the high degree of connectivity provided by each channel increase per channel bandwidth and wire utilization. At the same time, the design minimizes the serialization delay. It presents low network latencies due to its low diameter.&lt;br /&gt;
&lt;br /&gt;
=== Comparison of topologies ===&lt;br /&gt;
This data is taken from the analysis done in [[#References|[1]]]. &lt;br /&gt;
&lt;br /&gt;
[[File:Topologycomp.png|thumbnail|center|upright=5|Comparison of CMesh, Flattened Butterfly, and MECS]]&lt;br /&gt;
&lt;br /&gt;
The information in this table compares three of the topologies described above for two combinations of k which is the network radix (nodes/dimension) and c (concentration factor, 1 being no concentration). &lt;br /&gt;
&lt;br /&gt;
Maximum hop count is 2 for flattened butterfly and MECS, whereas is directly proportional to k in the case of Concentrated Mesh, what makes flattened butterfly and MECS better solutions with less network latency.&lt;br /&gt;
&lt;br /&gt;
The bisection channels is 1 for CMesh in both cases, but it gets doubled and even quadrupled between MECS and flattened butterfly. &lt;br /&gt;
&lt;br /&gt;
The bandwidth per channel in this example is better for CMesh and MECS, getting attenuated in the case of flattened butterfly.&lt;br /&gt;
&lt;br /&gt;
=== Examples of topologies in current NoCs ===&lt;br /&gt;
&lt;br /&gt;
==== Intel ====&lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=151 Intel Teraflops Research Chip] is made of an 8x10 mesh, and two 38-bit unidirectional links per channel. It has a bisection bandwidth of 380 GB/s, this includes data and sideband communication. There is a 5-port router inside of each of the computing nodes and the communication is carried out through message-passing. Its name comes from the one trillion mathematical calculations per second (1 Teraflops) of performance, accomplished with the 80 simple cores with each containing 2 floating point units and all of this consuming only 62 watts (less than many other processors).&lt;br /&gt;
 &lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=1 Single-Chip Cloud Computer] contains a 24-router mesh network with 256 GB/s bisection bandwidth. This design contains 48 fully functional cores and consumes only 25 watts. This newer model is more complete than the Teraflops Research model. It is full programmable and used for research by academia and private companies.&lt;br /&gt;
&lt;br /&gt;
==== Tilera ====&lt;br /&gt;
Tilera [http://www.tilera.com/products/processors Tilera TileGx, TilePro, and Tile64]  is a fabless semiconductor company that has developed a &amp;quot;tile processor&amp;quot; whereby the fabrication of the multi-processor device is greatly simplified by the placement of processor &amp;quot;tiles&amp;quot; on the die. The technology behind this innovation is iMesh, which is the name of the on-chip interconnection technology used in the Tile Processor's architecture&amp;lt;ref name=&amp;quot;Tilera&amp;quot;&amp;gt;&amp;quot;On-Chip Interconnection Architecture of the Tile Processor,&amp;quot; Wentzlaff, et al. 2007. IEEE Xplore.&amp;lt;/ref&amp;gt;.  The iMesh™ consists of five 8x8 independent mesh networks with two 32-bit unidirectional links per channel. The Tile Processor is innovative due to its highly scalable implementation of an on-chip network that utilizes 2D meshes. These are physically organized (as opposed to logically organized) due to design considerations when scaling and laying out new designs.It provides a bisection bandwidth of 320GB/s.&lt;br /&gt;
The tiles that conform the Tilera designs contain a complete processor with L1 and L2 caches. And each one can run an operating system in an independent manner or several tiles can run, together as a whole, an operating system like SMP Linux, for example.&lt;br /&gt;
&lt;br /&gt;
==== ST Microelectronics ====&lt;br /&gt;
[[File:Spidergon.png|thumb|c|right|upright=1.5|Example of Spidergon design]]&lt;br /&gt;
ST Microelectronics created the Spidergon design for the STNoC [[#References|[5]]]. &lt;br /&gt;
&lt;br /&gt;
The Spidergon is a pseudo-regular topology with a design that is composed of three building blocks: network interface, router, and physical link. These building blocks make the design ready to be tailored to the needs of the application. Each router building block has a degree of 3.&lt;br /&gt;
&lt;br /&gt;
The 3 building blocks can be used to create the specific design needed, with the input/output ports that the application requires. The blocks can be configured and stored in a library for creating the design. In the picture on the right, the example contains 2 of the building blocks (router and network interface) and a third undisclosed block.&lt;br /&gt;
&lt;br /&gt;
==== IBM ====&lt;br /&gt;
The IBM [http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html Cell] project uses an interconnect with four unidirectional 16B-wide data rings, two in each direction. The name of the interconnect is the Element Interconnect Bus (EIB) and allows for communication among the different components of the Cell, among them and with the external I/O. The total network bisection bandwidth is 307.2 GB/s. &lt;br /&gt;
&lt;br /&gt;
As a curiosity, the Cell processor was jointly developed with Sony and Toshiba, and is [http://en.wikipedia.org/wiki/Cell_(microprocessor) used] in the [http://news.cnet.com/PlayStation-3-chip-has-split-personality/2100-1043_3-5566340.html?tag=nl Sony PlayStation 3]. The Cell consists of a PowerPC core which manages eight synergistic processing engines (SPEs) that can be used for floating-point calculations. These calculations provide the engine for better gaming systems.&lt;br /&gt;
&lt;br /&gt;
====ARM CoreLink Interconnect====&lt;br /&gt;
The ARM CoreLink Interconnect is a highly flexible and configurable interconnection network specification that implements the [http://en.wikipedia.org/wiki/Advanced_Microcontroller_Bus_Architecture AMBA] (Advanced Microcontroller Bus Architecture) protocol. The AMBA protocol is &amp;quot;an open standard, on-chip interconnect specification for the connection and management of functional blocks in a System-on-Chip (SoC). It enables development of multi-processor designs with large numbers of controllers and peripherals.&amp;quot;&amp;lt;ref name=&amp;quot;ambadoc&amp;quot;&amp;gt;[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.amba/index.html AMBA] on the ARM Info Center.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Routing ==&lt;br /&gt;
&lt;br /&gt;
There are a variety of routing protocols that can be used for [http://en.wikipedia.org/wiki/System_on_a_chip SoC's], each having different advantages and disadvantages.  They can be broadly classified in several different ways.&lt;br /&gt;
&lt;br /&gt;
===General Routing Schemes===&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Store_and_forward Store and forward routing]==== &lt;br /&gt;
This routing scheme has been used since the early days of telecommunications.  It requires that the entire message be received at a node prior before it is propagated to the next node.  This protocol suffers from a high storage requirement and high latency, due to the need to completely buffer a message before forwarding it.[[#References|[7]]]  This approach can be quite effective when the average packet size is small in comparison with the channel widths.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Cut-through_switching Cut-Through routing] or [http://en.wikipedia.org/wiki/Wormhole_switching Worm Hole routing]====&lt;br /&gt;
These two protocols uses the switch to examine the flit header, decide where to send the message, and then start forwarding it immediately.  True cut-through routing lets the tail continue when the head is blocked, stacking message packets into a single switch (which requires a buffer large enough to hold the largest packet).  In worm hole routing, when the head of the message is blocked the message stays strung out over multiple nodes in the network, potentially blocking other messages (however, this needs only enough buffer space to store the piece of the packet that is sent between switches).  Using a cut-through protocol lowers latency but can suffer from packet corruption and must implement a scheme to handle this.[[#References|[7]]]&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Deterministic_routing Deterministic routing]====&lt;br /&gt;
This describes a routing scheme where, if we are given a pair of nodes, the same path will always be used between those nodes.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Adaptive_routing Adaptive routing]====&lt;br /&gt;
This is a routing scheme where the underlying routers may alter the path of packet flow in response to system conditions or other algorithm criteria.  Adaptive routing is intended to provide as many routes as possible to reach the destination.&lt;br /&gt;
&lt;br /&gt;
====Deadlock and Livelock====&lt;br /&gt;
&lt;br /&gt;
Deadlock and livelock are two separate situations that may occur during routing, both resulting in packets never reaching their destination.  They are defined as follows:&lt;br /&gt;
&lt;br /&gt;
''' Deadlock ''' is defined as a situation where there are activities (e.g., messages) each waiting for another to finish something.[[#References|[8]]] Since a waiting activity cannot finish, the messages are deadlocked.  This is analogous to the [http://en.wikipedia.org/wiki/Dining_philosophers_problem Dining Philosophers Problem], each deadlocked message is waiting on the result of another deadlocked message, and none are able to reach their destination.&lt;br /&gt;
&lt;br /&gt;
''' Livelock ''' is defined as a situation where a message can move from node to node but will never reach their destination node.[[#References|[8]]]  This is similar to deadlock in that the message never reaches its destination, but the message is still able to travel through portions of the network, making hops but never reaching its target.  This is analogous to a process spinning while waiting, the process itself is doing meaningless work but it is still active.  &lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
===Routing Protocols in SoC's===&lt;br /&gt;
&lt;br /&gt;
The specific routing protocols below are built using the ideas from the classes of protocols previously described.&lt;br /&gt;
&lt;br /&gt;
==== Source Routing ====&lt;br /&gt;
&lt;br /&gt;
The source node partially or totally computes the path a packet will take through the network and stores the information in the packet header.  The extra route information is sent in each packet, inflating their size.&lt;br /&gt;
&lt;br /&gt;
==== Distributed Routing ====&lt;br /&gt;
&lt;br /&gt;
Each switch in the network computes the next route that will be taking towards the destination.  The packet header contains only the destination information, reducing its size compared to source routing.  This approach requires routing tables to be present to direct the packet from node to node, which does not scale well when the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
==== Logic Based Distributed Routing (LBDR) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, routing is achieved by each router knowing its position in the architecture and being able to determine what direction it is from the destination of the packet.  It is most commonly used in 2D meshes, but it can be applied to other topologies as well.[[#References|[7]]]  Using this position information, it is possible to route the packet based on a small number of bits and a few logic gates per router, which saves over a table or a buffer.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are several variations of LBDR&lt;br /&gt;
&lt;br /&gt;
''' LBDRe ''' - This variation models up to two future hops before deciding where to send the packet next.    &lt;br /&gt;
&lt;br /&gt;
''' uLBDR (Universal LBDR) ''' - This variation adds packet multicast support to the protocol.&lt;br /&gt;
&lt;br /&gt;
''' bLBDR ''' - This variation adds the ability to broadcast messages to only certain regions (segments) of the network.&lt;br /&gt;
&lt;br /&gt;
==== Bufferless Deflection Routing (BLESS protocol) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, each flit of a packet is routed independently of every other flit through the network, and different flits from the same packet may take different paths.  Any contention between multiple flits results in one flit taking the desired path and the other flit being “deflected” to some other router.  This may result in undesirable routing, but the packets will eventually reach the destination.[[#References|[10]]]  This type of routing is feasible on every network topology that satisfies the following two constraints: Every router has at least the same number of output ports as the number of its input ports, and every router is reachable from every other router.[[#References|[10]]]  &lt;br /&gt;
&lt;br /&gt;
==== CHIPPER (Cheap-Interconnect Partially Permuting Router) ====&lt;br /&gt;
&lt;br /&gt;
This protocol was designed to address inefficient port allocation in the BLESS protocol.  A permutation network directs deflected flits to free output ports.  By limiting the requirements so that only that the highest-priority flit obtains its request, we can prevent livelock.  In the case of contention, arbitration logic chooses a winning flit.  It does this by choosing a single packet, and prioritize that packet globally above all other packets for long enough that its delivery is ensured.  Every packet in the system eventually receives this special status, so every packet is eventually delivered (the Golden Packet scheme).[[#References|[11]]]&lt;br /&gt;
&lt;br /&gt;
==== Dimension-order Routing ====&lt;br /&gt;
&lt;br /&gt;
This protocol is a deterministic strategy for multidimensional networks.  Each direction is chosen in order and routed completely before switching to the next direction.  For example, in a 2D mesh, dimension order routing could be implemented by completely routing the packet in the X-dimension before beginning to route in the Y-dimension.  This is extensible to higher order connections as well, for example, hypercubes can be routed in dimension order by routing packets along the dimensions in the order of different bit positions of the source and destination address, one bit position at a time.[[#References|[9]]]&lt;br /&gt;
&lt;br /&gt;
== Lines of Research ==&lt;br /&gt;
From NoCs perspective, there are many lines of research besides the abundant of technologies of the commercial designs. Some of them are presented in this section.&lt;br /&gt;
&lt;br /&gt;
=== Optical on-chip interconnects ===&lt;br /&gt;
IBM has been performing extensive research on photonic layer inside of the CMP used not only for connecting several cores, but also for routing traffic: [http://researcher.ibm.com/view_project.php?id=2757 Silicon Integrated Nanophotonics.] This technology was actually used in the IBM Cell chip that was mentioned in above sections. The main advantages are reliability and power efficiency.&lt;br /&gt;
&lt;br /&gt;
This [http://www.research.ibm.com/photonics/publications/ecoc_tutorial_2008.pdf tutorial] explains some differences between electronics and photonics in terms of power consumption, the more efficient is the computing from power's perspective, the more FLOPs per Watt:&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Electronics'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Photonics'''&lt;br /&gt;
|-&lt;br /&gt;
| Electronic network ~500W||Optic network &amp;lt;80W&lt;br /&gt;
|-&lt;br /&gt;
| power = bandwidth x length||power does not depend on bitrate nor length&lt;br /&gt;
|-&lt;br /&gt;
| buffer on chip that rx and re-tx every bit at every switch||rx (modulate) data once, without having to re-tx&lt;br /&gt;
|-&lt;br /&gt;
| ||switching fabric has almost no power dissipation&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In academia, there are articles like [[#References|[6]]] which proposes a new topology created for optical on-chip interconnects. They refer to previous papers that cite adaptations of well-known electronic designs, but highlight the need to provide a &amp;quot;scalable all-optical NoC, referred to as 2D-HERT, with passive routing of optical data streams based on their wavelengths.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Reconfigurable NoC ===&lt;br /&gt;
Another field of study is the Software reconfigurable on-chip networks. They are commonly based on the 2D mesh topology. The main idea is to be able to reconfigure the NoC depending on the application and during run-time to react to congestion problems or, in general, adapt to the traffic load. &lt;br /&gt;
&lt;br /&gt;
In [[#References|[12]]], the authors propose a design based on the properties of the  [http://en.wikipedia.org/wiki/Field-programmable_gate_array field-programmable gate array (FPGA)]. It can dynamically implement circuit-switching channels, perform variations in the topology, and reconfigure routing tables. One of the main drawbacks is the overhead that this reconfiguration introduces, although it is designed to minimize it.&lt;br /&gt;
&lt;br /&gt;
=== Bio NoC ===&lt;br /&gt;
Bio NoC or ANoC (Autonomic Network-on-Chip) is based on the concept of the human autonomic nervous system or the human biological immune system. The intention is to provide a NoC with self-organization, self-configuration, and self-healing to dynamically control networking functions. &lt;br /&gt;
&lt;br /&gt;
[[#References|[13]]] presents a collection of chapters/articles from emerging research issues in the ANoC field of application.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
 &lt;br /&gt;
[1] Mirza-Aghatabar, M.; Koohi, S.; Hessabi, S.; Pedram, M.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4341445 &amp;quot;An Empirical Investigation of Mesh and Torus NoC Topologies Under Different Routing Algorithms and Traffic Models,&amp;quot;] Digital System Design Architectures, Methods and Tools, 2007. DSD 2007. 10th Euromicro Conference on , vol., no., pp.19-26, 29-31 Aug. 2007&lt;br /&gt;
&lt;br /&gt;
[2] Ying Ping Zhang; Taikyeong Jeong; Fei Chen; Haiping Wu; Nitzsche, R.; Gao, G.R.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1639301 &amp;quot;A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture,&amp;quot;] Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International , vol., no., pp. 10 pp., 25-29 April 2006&lt;br /&gt;
&lt;br /&gt;
[3] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. [http://www.eecg.toronto.edu/~enright/tilera.pdf On-Chip Interconnection Architecture of the Tile Processor.] IEEE Micro 27, 5 (September 2007), 15-31.&lt;br /&gt;
&lt;br /&gt;
[4] D. N. Jayasimha, B. Zafar, Y. Hoskote. [http://blogs.intel.com/wp-content/mt-content/com/research/terascale/ODI_why-different.pdf On-chip interconnection networks: why they are different and how to compare them.] Technical Report, Intel Corp, 2006&lt;br /&gt;
&lt;br /&gt;
[5] John Kim, James Balfour, and William Dally. [http://cva.stanford.edu/publications/2007/MICRO_FBFLY.pdf Flattened butterfly topology for on-chip networks.] In Proceedings of the 40th International Symposium on Microarchitecture, pages 172–182, December 2007.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
[1] B. Grot and S. W. Keckler. [http://www.cs.utexas.edu/~bgrot/docs/CMP-MSI_08.pdf Scalable on-chip interconnect topologies.] 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects, 2008.&lt;br /&gt;
&lt;br /&gt;
[2] Natalie Enright Jerger and Li-Shiuan Peh. [http://www.morganclaypool.com/doi/abs/10.2200/S00209ED1V01Y200907CAC008?journalCode=cac On-Chip Networks.] Synthesis Lectures on Computer Architecture. 2009, 141 pages. Morgan and Claypool Publishers.&lt;br /&gt;
&lt;br /&gt;
[3] Yan Solihin. (2008). [http://www.cesr.ncsu.edu/solihin/Main.html Fundamentals of parallel computer architecture.] Solihin Pub.&lt;br /&gt;
&lt;br /&gt;
[4] James Balfour and William J. Dally. 2006. [http://www.cs.berkeley.edu.prox.lib.ncsu.edu/~kubitron/courses/cs258-S08/handouts/papers/jbalfour_ICS.pdf Design tradeoffs for tiled CMP on-chip networks.] In Proceedings of the 20th annual international conference on Supercomputing (ICS '06). ACM, New York, NY, USA, 187-198.&lt;br /&gt;
&lt;br /&gt;
[5] Dubois, F.; Cano, J.; Coppola, M.; Flich, J.; Petrot, F.; , [http://www.comcas.eu/publications/Spidergon_STNoC_Design.pdf Spidergon STNoC design flow,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.267-268, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[6] Koohi, S.; Abdollahi, M.; Hessabi, S.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=5948588&amp;amp;isnumber=5948548 All-optical wavelength-routed NoC based on a novel hierarchical topology,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.97-104, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[7] Flich, J.; Duato, J.;, [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=4407676&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4407676 Logic-Based Distributed Routing for NoCs,] 2008 Computer Architecture Letters, vol. 7, no. 1, pp.13-16, Jan 2008&lt;br /&gt;
&lt;br /&gt;
[8] Wu, J.; [http://www.cse.fau.edu/~jie/research/publications/Publication_files/ieeetc0309.pdf A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model,] 2003 IEEE Transactions on Computers, Vol. 52, No. 9, pp.1154-1169, Sept 2003&lt;br /&gt;
&lt;br /&gt;
[9] Veselovsky, G.; Batovski, D.A.; [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1183584&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1183584 A study of the permutation capability of a binary hypercube under deterministic dimension-order routing,] 2003 Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, vol., no., pp.173-177, 5-7 Feb. 2003&lt;br /&gt;
&lt;br /&gt;
[10] Moscibroda, T; Mutlu, O.; [http://research.microsoft.com/pubs/80241/isca_2009-bless.pdf A Case for Bufferless Routing in On-Chip Networks,] ACM SIGARCH Computer Architecture News, Volume 37 Issue 3, June 2009&lt;br /&gt;
&lt;br /&gt;
[11] Fallin, C.; Craik, C.; Mutlu, O.; [http://www.ece.cmu.edu/~safari/pubs/chipper_hpca2011.pdf CHIPPER: A Low-complexity Bufferless Deflection Router,] Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA 2011), San Antonio, TX, February 2011.&lt;br /&gt;
&lt;br /&gt;
[12] V. Rana, et al., [http://infoscience.epfl.ch/record/130661/files/paperM2B-VLSI-SoC2008%5b1%5d.pdf A Reconfigurable Network-on-Chip Architecture for Optimal Multi-Processor SoC Communication,] in VLSI-SoC, 2009.&lt;br /&gt;
&lt;br /&gt;
[13] Cong-Vinh, P. (December 2011). [http://www.crcpress.com/product/isbn/9781439829110 Autonomic networking-on-chip: Bio-inspired specification, development, and verification.] CRC Press.&lt;br /&gt;
&lt;br /&gt;
[14] S. Kumar, et al., [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1016885&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1016885 A Network on Chip Architecture and Design Methodology,] VLSI on Annual Symposium, IEEE Computer Society ISVLSI 2002.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
1. Advantage of 2-D Mesh&lt;br /&gt;
&lt;br /&gt;
a) simple design&lt;br /&gt;
&lt;br /&gt;
b) cumbersome design&lt;br /&gt;
&lt;br /&gt;
c) degree is the same for all nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
2. Diameter is&lt;br /&gt;
&lt;br /&gt;
a) minimum hop count&lt;br /&gt;
&lt;br /&gt;
b) maximum hop count&lt;br /&gt;
&lt;br /&gt;
c) number of neighbors &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
3. SOC stands for&lt;br /&gt;
&lt;br /&gt;
a) System of Chips&lt;br /&gt;
&lt;br /&gt;
b) Switch of Cores&lt;br /&gt;
&lt;br /&gt;
c) System on a Chip&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4. In a direct topology,&lt;br /&gt;
&lt;br /&gt;
a) each node contains a network interface acting as a router in order to transfer information&lt;br /&gt;
&lt;br /&gt;
b) there are nodes that act as routers&lt;br /&gt;
&lt;br /&gt;
c) only one node is a computational nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5. The Single-Chip Cloud Computer contains &lt;br /&gt;
&lt;br /&gt;
a) an 8x10 mesh&lt;br /&gt;
&lt;br /&gt;
b) a 64-router mesh network&lt;br /&gt;
&lt;br /&gt;
c) a 24-router mesh network&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
6. A deterministic routing scheme uses algorithms to determine the most advantageous path to the target node.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
7. Livelock is necessary to maintain coherence in routing protocols.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
8. Dimension Order routing&lt;br /&gt;
&lt;br /&gt;
a) is only possible with 2D mesh-based topologies.&lt;br /&gt;
&lt;br /&gt;
b) attempts to route all packets in one dimension before starting another.&lt;br /&gt;
&lt;br /&gt;
c) uses routing tables to find the packet destination.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
9. Source routing&lt;br /&gt;
&lt;br /&gt;
a) includes information in the packet about the destination node&lt;br /&gt;
&lt;br /&gt;
b) uses routing information calculated by the sending node&lt;br /&gt;
&lt;br /&gt;
c) all of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
10. Store and forward routing&lt;br /&gt;
&lt;br /&gt;
a) requires the entire message to be broken into regular sized pieces and sent over the network&lt;br /&gt;
&lt;br /&gt;
b) is an optimal routing protocol&lt;br /&gt;
&lt;br /&gt;
c) buffers the entire message in each node along the route before sending it to the next node&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/12b_sl&amp;diff=74918</id>
		<title>CSC/ECE 506 Spring 2013/12b sl</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/12b_sl&amp;diff=74918"/>
		<updated>2013-04-17T18:39:20Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;On-chip interconnects&lt;br /&gt;
__TOC__ &lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The current trend in microprocessor design has shifted from extracting ever increasing performance gains from single core architecture to leveraging the power of multiple cores per die.  This creates new challenges not present in single core systems.  A multi core processor must have a method of passing information between processing cores that is efficient in terms of power consumed, space used on die, and the speed at which messages are delivered.  As physical wire widths are decreased and the number of wires is increased, the difference between gate delay and wire delay is exacerbated.[[#References|[14]]]  To combat these challenges, much research has been done in the area of on-chip networks.&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
On-chip interconnects are a natural extension of the high integration levels that nowadays are reached with multiprocessor integration. Moore's law predicted that the number of transistors in an integrated circuit doubles every two years. This assumption has driven the integration of on-chip components and continues to show the way in the semiconductor industry.&lt;br /&gt;
[[File:Itr MIC image 920x460.png|thumb|c|right|Intel® MIC]]&lt;br /&gt;
In recent years, the main players in the chip industry keep racing to provide more cores integrated in a chip, with the multi-core (more than one core) and many-core (multi-core with so many cores that the historical multi-core techniques are not efficient any longer) technologies. This integration is known as [http://en.wikipedia.org/wiki/Multi-core_(computing) CMP] (chip multiprocessor) and lately Intel has coined the term [http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html Intel® Many Integrated Core (Intel® MIC)].&lt;br /&gt;
&lt;br /&gt;
To make feasible the communication in between these many cores inside of a single chip, the traditional off-chip network has proved to have limited applications. According to [[#References|[2]]], the off-chip designs suffered from I/O bottlenecks which are a diminished problem for on-chip technologies as the internal wiring provides much higher bandwidth and overcomes the delay associated with the external traffic. Nevertheless, the on-chip designs still have some challenges that need further study. Among some of these issues are power consumption and space constraints.&lt;br /&gt;
&lt;br /&gt;
=== Terminology ===&lt;br /&gt;
Some common terms:&lt;br /&gt;
* [http://en.wikipedia.org/wiki/System_on_a_chip SoCs] (Systems-on-a-chip), which commonly refer to chips that are made for a specific application or domain area.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/MPSoC MPSoCs] (Multiprocessor systems-on-chip), referring to a SoC that uses multi-core technology.&lt;br /&gt;
It is interesting to note that for the particular theme of this article, there are at least three different acronyms referring to the same term. These are new technologies and different researchers have adopted different nomenclature. The acronyms are:&lt;br /&gt;
* NoC (network-on-chip), this is the most common term and also used in this article&lt;br /&gt;
* OCIN (on-chip interconnection network) &lt;br /&gt;
* OCN (on-chip network)&lt;br /&gt;
&lt;br /&gt;
== Topologies ==&lt;br /&gt;
Topology refers to the layout or arrangement of interconnections among the processing elements. In general, a good topology aims to minimize network latency and maximize throughput.&lt;br /&gt;
There are certain metrics that help with the classification and comparison of the different topology types. Some of them are defined in Solihin's [[#References|[3]]] textbook in chapter 12.&lt;br /&gt;
&lt;br /&gt;
*'''Degree''' is defined as the number of nodes that are neighbors to, or in other words, can be reached from it in one hop&lt;br /&gt;
*'''Hop count''' is the number of nodes through which a message needs to go through to get to the destination&lt;br /&gt;
*'''Diameter''' is the maximum hop count&lt;br /&gt;
*'''Path diversity''' is useful for the routing algorithm and is given by the amount of shortest paths that a topology offers between two nodes.&lt;br /&gt;
*'''Bisection width''' is the smallest number of wires you have to cut to separate the network into two halves&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Topologies can be classified as direct and indirect topologies.&lt;br /&gt;
In a direct topology, each node is connected to other nodes, which are named neighbouring nodes. Each node contains a network interface acting as a router in order to transfer information.&lt;br /&gt;
In an indirect topology, there are nodes that are no computational but act as switches to transfer the traffic among the rest of the nodes, including other switches. It is called indirect because packets are switched through specific elements that are not part of the computational nodes themselves.&lt;br /&gt;
&lt;br /&gt;
An example of direct topologies is 2-D Mesh. An example of indirect topology is Flattened Butterfly.  &lt;br /&gt;
&lt;br /&gt;
There are many different topologies that could be introduced in this section. Some of the missing topologies include but are not limited to:&lt;br /&gt;
&lt;br /&gt;
* Hypercube&lt;br /&gt;
* Shuffle-exchange&lt;br /&gt;
* Torus&lt;br /&gt;
* Trees&lt;br /&gt;
&lt;br /&gt;
They are just cited here for completion, related information can be found at [http://www.cs.cf.ac.uk/Parallel/Year2/section5.html Interconnection Networks]&lt;br /&gt;
&lt;br /&gt;
===Rings===&lt;br /&gt;
[http://en.wikipedia.org/wiki/Ring_network Ring topologies] can be effective when the “number of cores is still relatively small but is larger than what can be supported using a bus” [Solihin 409]. Such cases are considered to use “medium-scale” interconnection networks.&lt;br /&gt;
&lt;br /&gt;
=== 2-D Mesh ===&lt;br /&gt;
[[File:Mesh.png|thumb|c|right|upright=0.75|2D Mesh]]This has been a very popular topology due to its simple design and low layout and router complexity. It is often described as a k-ary n-cube , where k is the number of nodes on each dimension, and n is the number of dimensions. For example, a 4-ary 2-cube is a 4x4 2D mesh.&lt;br /&gt;
Another advantage is that this topology is similar to the physical die layout, making it more suitable to implement in tiled architectures. For reference, the combination of the switch and a processor is named ''tile''.&lt;br /&gt;
&lt;br /&gt;
But not everything are advantages in this topology. One of the drawbacks of 2D Meshes is that the degree of the nodes along the edges is lower than the degree of the central nodes. This makes the 2D Mesh asymmetrical along the edges, therefore from the networking perspective, there is less demand for edge channels than for central channels.&lt;br /&gt;
&lt;br /&gt;
Jerger and Peh [[#References|[2]]], provide the following information on parameters for a mesh as defined as a k-ary n-cube:&lt;br /&gt;
*the switch degree for a 2D mesh would be 4, as its network requires two channels in each dimension or 2n, although some ports on the edge will be unused.&lt;br /&gt;
*average minimum hop count: &lt;br /&gt;
:{| {{table}}&lt;br /&gt;
| nk/3|| ||k even&lt;br /&gt;
|-&lt;br /&gt;
| n(k/3-1/3k)|| ||k odd&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
*the channel load across the bisection of a mesh under uniform random traffic with an even k is k/4&lt;br /&gt;
*meshes provide diversity of paths for routing messages&lt;br /&gt;
&lt;br /&gt;
=== Concentration Mesh ===&lt;br /&gt;
[[File:Concentratedmesh.png|thumb|c|right|upright=0.75|Concentration Mesh]] This is an evolution of the mesh topology. There is no real need to have a 1:1 relationship between the number of cores and the number of switches/routers. The Concentration mesh reduces the ratio to 1:4, i.e. each router serves four computing nodes. &lt;br /&gt;
&lt;br /&gt;
The advantage over the simple mesh is the decrease in the average hop count. This is important in terms of scaling the solution. But it is not as scalable as it could seem, as its degree is confined to the crossbar complexity [[#References|[1]]]&lt;br /&gt;
&lt;br /&gt;
The reduction in the ratio introduces a lower bisection channel count, but it can be avoided by introducing express channels, as demonstrated in [[#References|[4]]].&lt;br /&gt;
&lt;br /&gt;
Another drawback is that the port bandwidth can become a bottleneck in periods of high traffic.&lt;br /&gt;
&lt;br /&gt;
=== Flattened Butterfly ===&lt;br /&gt;
[[File:Flbfly.png|thumb|c|right|upright=0.75|Flattened butterfly]]A butterfly topology is often described as a k-ary n-fly, which implies k&amp;lt;sup&amp;gt;n&amp;lt;/sup&amp;gt; network nodes with n stages of k&amp;lt;sup&amp;gt;n−1&amp;lt;/sup&amp;gt; k × k intermediate routing nodes. The degree of each intermediate router is 2k.  &lt;br /&gt;
The ﬂattened butterﬂy is made by ﬂattening (i.e. combining) the routers in each row of a butterﬂy topology while preserving the inter-router connections. It does non-minimal routing for load balancing improvement in the network.&lt;br /&gt;
Some advantages are that the maximum distance between nodes is two hops and it has lower latency and better throughput than that of the mesh topology.&lt;br /&gt;
For the disadvantages, it has high channel count (k&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;/2 per row/column), low channel utilization, and increased control complexity.&lt;br /&gt;
The flattened butterfly offers the benefits of a tree (less constraints on root-level bandwidth [Solihin 367]) as well as the ability to actually be mapped to a substrate, but because of node concentration&amp;lt;ref name=&amp;quot;GrotKeckler&amp;quot;/&amp;gt; the number of channels required for high scalability is cost- and validation-prohibitive.&lt;br /&gt;
&lt;br /&gt;
===Crossbar Switch===&lt;br /&gt;
A crossbar switch topology uses a bus arrangement with the bus lines physically perpendicular to each other and whose intersections are connected or disconnected with a switch. In the case of [http://en.wikipedia.org/wiki/Multi-core_(computing) CMPs], this switch is a transistor or, depending on the desired characteristics of the system, a programmable fuse. Due to their ability to be [http://en.wikipedia.org/wiki/Multistage_interconnection_networks multi-staged]&amp;lt;ref name=&amp;quot;wikicrossbarsemi&amp;quot;&amp;gt;&amp;quot;[http://en.wikipedia.org/wiki/Crossbar_switch#Semiconductor Crossbar switch].&amp;quot; Wikipedia. Last accessed April 24, 2012.&amp;lt;/ref&amp;gt;, these topologies lend themselves to being used for memory in large-scale systems. The IBM Cyclops64 architecture is an example of the implementation of this architecture&amp;lt;ref name=&amp;quot;cyclops64&amp;quot;&amp;gt;Zhang, Ying Ping. &amp;quot;[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1639301 A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture]. April 2006. IEEE Xplore.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Multidrop Express Channels (MECS) ===&lt;br /&gt;
[[File:Mecs.png|thumb|c|right|upright=0.75|MECS]] Multidrop Express Channels was proposed in [[#References|[1]]] by Grot and Keckler. Their motivation was that performance and scalability should be obtained by managing wiring. &lt;br /&gt;
Multidrop Express Channels is defined by its authors as a &amp;quot;one to-many communication fabric that enables a high degree of connectivity in a bandwidth-efﬁcient manner.&amp;quot;  Based on point-to-point unidirectional links. This makes for a high degree of connectivity with fewer bisection channels and higher bandwidth for each channel. &lt;br /&gt;
&lt;br /&gt;
Some of the parameters calculated for MECS are:&lt;br /&gt;
*Bisection channel count per each row/column is equal to k.&lt;br /&gt;
*Network diameter (maximum hop count) is two.&lt;br /&gt;
*The number of nodes accessible through each channel ranges from 1 to k − 1.&lt;br /&gt;
*A node has 1 output port per direction&lt;br /&gt;
*The input port count is 2(k − 1)&lt;br /&gt;
&lt;br /&gt;
The low channel count and the high degree of connectivity provided by each channel increase per channel bandwidth and wire utilization. At the same time, the design minimizes the serialization delay. It presents low network latencies due to its low diameter.&lt;br /&gt;
&lt;br /&gt;
=== Comparison of topologies ===&lt;br /&gt;
This data is taken from the analysis done in [[#References|[1]]]. &lt;br /&gt;
&lt;br /&gt;
[[File:Topologycomp.png|thumbnail|center|upright=5|Comparison of CMesh, Flattened Butterfly, and MECS]]&lt;br /&gt;
&lt;br /&gt;
The information in this table compares three of the topologies described above for two combinations of k which is the network radix (nodes/dimension) and c (concentration factor, 1 being no concentration). &lt;br /&gt;
&lt;br /&gt;
Maximum hop count is 2 for flattened butterfly and MECS, whereas is directly proportional to k in the case of Concentrated Mesh, what makes flattened butterfly and MECS better solutions with less network latency.&lt;br /&gt;
&lt;br /&gt;
The bisection channels is 1 for CMesh in both cases, but it gets doubled and even quadrupled between MECS and flattened butterfly. &lt;br /&gt;
&lt;br /&gt;
The bandwidth per channel in this example is better for CMesh and MECS, getting attenuated in the case of flattened butterfly.&lt;br /&gt;
&lt;br /&gt;
=== Examples of topologies in current NoCs ===&lt;br /&gt;
&lt;br /&gt;
==== Intel ====&lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=151 Intel Teraflops Research Chip] is made of an 8x10 mesh, and two 38-bit unidirectional links per channel. It has a bisection bandwidth of 380 GB/s, this includes data and sideband communication. There is a 5-port router inside of each of the computing nodes and the communication is carried out through message-passing. Its name comes from the one trillion mathematical calculations per second (1 Teraflops) of performance, accomplished with the 80 simple cores with each containing 2 floating point units and all of this consuming only 62 watts (less than many other processors).&lt;br /&gt;
 &lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=1 Single-Chip Cloud Computer] contains a 24-router mesh network with 256 GB/s bisection bandwidth. This design contains 48 fully functional cores and consumes only 25 watts. This newer model is more complete than the Teraflops Research model. It is full programmable and used for research by academia and private companies.&lt;br /&gt;
&lt;br /&gt;
==== Tilera ====&lt;br /&gt;
Tilera [http://www.tilera.com/products/processors Tilera TileGx, TilePro, and Tile64]  is a fabless semiconductor company that has developed a &amp;quot;tile processor&amp;quot; whereby the fabrication of the multi-processor device is greatly simplified by the placement of processor &amp;quot;tiles&amp;quot; on the die. The technology behind this innovation is iMesh, which is the name of the on-chip interconnection technology used in the Tile Processor's architecture&amp;lt;ref name=&amp;quot;Tilera&amp;quot;&amp;gt;&amp;quot;On-Chip Interconnection Architecture of the Tile Processor,&amp;quot; Wentzlaff, et al. 2007. IEEE Xplore.&amp;lt;/ref&amp;gt;.  The iMesh™ consists of five 8x8 independent mesh networks with two 32-bit unidirectional links per channel. The Tile Processor is innovative due to its highly scalable implementation of an on-chip network that utilizes 2D meshes. These are physically organized (as opposed to logically organized) due to design considerations when scaling and laying out new designs.It provides a bisection bandwidth of 320GB/s.&lt;br /&gt;
The tiles that conform the Tilera designs contain a complete processor with L1 and L2 caches. And each one can run an operating system in an independent manner or several tiles can run, together as a whole, an operating system like SMP Linux, for example.&lt;br /&gt;
&lt;br /&gt;
==== ST Microelectronics ====&lt;br /&gt;
[[File:Spidergon.png|thumb|c|right|upright=1.5|Example of Spidergon design]]&lt;br /&gt;
ST Microelectronics created the Spidergon design for the STNoC [[#References|[5]]]. &lt;br /&gt;
&lt;br /&gt;
The Spidergon is a pseudo-regular topology with a design that is composed of three building blocks: network interface, router, and physical link. These building blocks make the design ready to be tailored to the needs of the application. Each router building block has a degree of 3.&lt;br /&gt;
&lt;br /&gt;
The 3 building blocks can be used to create the specific design needed, with the input/output ports that the application requires. The blocks can be configured and stored in a library for creating the design. In the picture on the right, the example contains 2 of the building blocks (router and network interface) and a third undisclosed block.&lt;br /&gt;
&lt;br /&gt;
==== IBM ====&lt;br /&gt;
The IBM [http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html Cell] project uses an interconnect with four unidirectional 16B-wide data rings, two in each direction. The name of the interconnect is the Element Interconnect Bus (EIB) and allows for communication among the different components of the Cell, among them and with the external I/O. The total network bisection bandwidth is 307.2 GB/s. &lt;br /&gt;
&lt;br /&gt;
As a curiosity, the Cell processor was jointly developed with Sony and Toshiba, and is [http://en.wikipedia.org/wiki/Cell_(microprocessor) used] in the [http://news.cnet.com/PlayStation-3-chip-has-split-personality/2100-1043_3-5566340.html?tag=nl Sony PlayStation 3]. The Cell consists of a PowerPC core which manages eight synergistic processing engines (SPEs) that can be used for floating-point calculations. These calculations provide the engine for better gaming systems.&lt;br /&gt;
&lt;br /&gt;
===ARM CoreLink Interconnect===&lt;br /&gt;
The ARM CoreLink Interconnect is a highly flexible and configurable interconnection network specification that implements the [http://en.wikipedia.org/wiki/Advanced_Microcontroller_Bus_Architecture AMBA] (Advanced Microcontroller Bus Architecture) protocol. The AMBA protocol is &amp;quot;an open standard, on-chip interconnect specification for the connection and management of functional blocks in a System-on-Chip (SoC). It enables development of multi-processor designs with large numbers of controllers and peripherals.&amp;quot;&amp;lt;ref name=&amp;quot;ambadoc&amp;quot;&amp;gt;[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.amba/index.html AMBA] on the ARM Info Center.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Routing ==&lt;br /&gt;
&lt;br /&gt;
There are a variety of routing protocols that can be used for [http://en.wikipedia.org/wiki/System_on_a_chip SoC's], each having different advantages and disadvantages.  They can be broadly classified in several different ways.&lt;br /&gt;
&lt;br /&gt;
===General Routing Schemes===&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Store_and_forward Store and forward routing]==== &lt;br /&gt;
This routing scheme has been used since the early days of telecommunications.  It requires that the entire message be received at a node prior before it is propagated to the next node.  This protocol suffers from a high storage requirement and high latency, due to the need to completely buffer a message before forwarding it.[[#References|[7]]]  This approach can be quite effective when the average packet size is small in comparison with the channel widths.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Cut-through_switching Cut-Through routing] or [http://en.wikipedia.org/wiki/Wormhole_switching Worm Hole routing]====&lt;br /&gt;
These two protocols uses the switch to examine the flit header, decide where to send the message, and then start forwarding it immediately.  True cut-through routing lets the tail continue when the head is blocked, stacking message packets into a single switch (which requires a buffer large enough to hold the largest packet).  In worm hole routing, when the head of the message is blocked the message stays strung out over multiple nodes in the network, potentially blocking other messages (however, this needs only enough buffer space to store the piece of the packet that is sent between switches).  Using a cut-through protocol lowers latency but can suffer from packet corruption and must implement a scheme to handle this.[[#References|[7]]]&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Deterministic_routing Deterministic routing]====&lt;br /&gt;
This describes a routing scheme where, if we are given a pair of nodes, the same path will always be used between those nodes.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Adaptive_routing Adaptive routing]====&lt;br /&gt;
This is a routing scheme where the underlying routers may alter the path of packet flow in response to system conditions or other algorithm criteria.  Adaptive routing is intended to provide as many routes as possible to reach the destination.&lt;br /&gt;
&lt;br /&gt;
====Deadlock and Livelock====&lt;br /&gt;
&lt;br /&gt;
Deadlock and livelock are two separate situations that may occur during routing, both resulting in packets never reaching their destination.  They are defined as follows:&lt;br /&gt;
&lt;br /&gt;
''' Deadlock ''' is defined as a situation where there are activities (e.g., messages) each waiting for another to finish something.[[#References|[8]]] Since a waiting activity cannot finish, the messages are deadlocked.  This is analogous to the [http://en.wikipedia.org/wiki/Dining_philosophers_problem Dining Philosophers Problem], each deadlocked message is waiting on the result of another deadlocked message, and none are able to reach their destination.&lt;br /&gt;
&lt;br /&gt;
''' Livelock ''' is defined as a situation where a message can move from node to node but will never reach their destination node.[[#References|[8]]]  This is similar to deadlock in that the message never reaches its destination, but the message is still able to travel through portions of the network, making hops but never reaching its target.  This is analogous to a process spinning while waiting, the process itself is doing meaningless work but it is still active.  &lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
===Routing Protocols in SoC's===&lt;br /&gt;
&lt;br /&gt;
The specific routing protocols below are built using the ideas from the classes of protocols previously described.&lt;br /&gt;
&lt;br /&gt;
==== Source Routing ====&lt;br /&gt;
&lt;br /&gt;
The source node partially or totally computes the path a packet will take through the network and stores the information in the packet header.  The extra route information is sent in each packet, inflating their size.&lt;br /&gt;
&lt;br /&gt;
==== Distributed Routing ====&lt;br /&gt;
&lt;br /&gt;
Each switch in the network computes the next route that will be taking towards the destination.  The packet header contains only the destination information, reducing its size compared to source routing.  This approach requires routing tables to be present to direct the packet from node to node, which does not scale well when the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
==== Logic Based Distributed Routing (LBDR) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, routing is achieved by each router knowing its position in the architecture and being able to determine what direction it is from the destination of the packet.  It is most commonly used in 2D meshes, but it can be applied to other topologies as well.[[#References|[7]]]  Using this position information, it is possible to route the packet based on a small number of bits and a few logic gates per router, which saves over a table or a buffer.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are several variations of LBDR&lt;br /&gt;
&lt;br /&gt;
''' LBDRe ''' - This variation models up to two future hops before deciding where to send the packet next.    &lt;br /&gt;
&lt;br /&gt;
''' uLBDR (Universal LBDR) ''' - This variation adds packet multicast support to the protocol.&lt;br /&gt;
&lt;br /&gt;
''' bLBDR ''' - This variation adds the ability to broadcast messages to only certain regions (segments) of the network.&lt;br /&gt;
&lt;br /&gt;
==== Bufferless Deflection Routing (BLESS protocol) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, each flit of a packet is routed independently of every other flit through the network, and different flits from the same packet may take different paths.  Any contention between multiple flits results in one flit taking the desired path and the other flit being “deflected” to some other router.  This may result in undesirable routing, but the packets will eventually reach the destination.[[#References|[10]]]  This type of routing is feasible on every network topology that satisfies the following two constraints: Every router has at least the same number of output ports as the number of its input ports, and every router is reachable from every other router.[[#References|[10]]]  &lt;br /&gt;
&lt;br /&gt;
==== CHIPPER (Cheap-Interconnect Partially Permuting Router) ====&lt;br /&gt;
&lt;br /&gt;
This protocol was designed to address inefficient port allocation in the BLESS protocol.  A permutation network directs deflected flits to free output ports.  By limiting the requirements so that only that the highest-priority flit obtains its request, we can prevent livelock.  In the case of contention, arbitration logic chooses a winning flit.  It does this by choosing a single packet, and prioritize that packet globally above all other packets for long enough that its delivery is ensured.  Every packet in the system eventually receives this special status, so every packet is eventually delivered (the Golden Packet scheme).[[#References|[11]]]&lt;br /&gt;
&lt;br /&gt;
==== Dimension-order Routing ====&lt;br /&gt;
&lt;br /&gt;
This protocol is a deterministic strategy for multidimensional networks.  Each direction is chosen in order and routed completely before switching to the next direction.  For example, in a 2D mesh, dimension order routing could be implemented by completely routing the packet in the X-dimension before beginning to route in the Y-dimension.  This is extensible to higher order connections as well, for example, hypercubes can be routed in dimension order by routing packets along the dimensions in the order of different bit positions of the source and destination address, one bit position at a time.[[#References|[9]]]&lt;br /&gt;
&lt;br /&gt;
== Lines of Research ==&lt;br /&gt;
From NoCs perspective, there are many lines of research besides the abundant of technologies of the commercial designs. Some of them are presented in this section.&lt;br /&gt;
&lt;br /&gt;
=== Optical on-chip interconnects ===&lt;br /&gt;
IBM has been performing extensive research on photonic layer inside of the CMP used not only for connecting several cores, but also for routing traffic: [http://researcher.ibm.com/view_project.php?id=2757 Silicon Integrated Nanophotonics.] This technology was actually used in the IBM Cell chip that was mentioned in above sections. The main advantages are reliability and power efficiency.&lt;br /&gt;
&lt;br /&gt;
This [http://www.research.ibm.com/photonics/publications/ecoc_tutorial_2008.pdf tutorial] explains some differences between electronics and photonics in terms of power consumption, the more efficient is the computing from power's perspective, the more FLOPs per Watt:&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Electronics'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Photonics'''&lt;br /&gt;
|-&lt;br /&gt;
| Electronic network ~500W||Optic network &amp;lt;80W&lt;br /&gt;
|-&lt;br /&gt;
| power = bandwidth x length||power does not depend on bitrate nor length&lt;br /&gt;
|-&lt;br /&gt;
| buffer on chip that rx and re-tx every bit at every switch||rx (modulate) data once, without having to re-tx&lt;br /&gt;
|-&lt;br /&gt;
| ||switching fabric has almost no power dissipation&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In academia, there are articles like [[#References|[6]]] which proposes a new topology created for optical on-chip interconnects. They refer to previous papers that cite adaptations of well-known electronic designs, but highlight the need to provide a &amp;quot;scalable all-optical NoC, referred to as 2D-HERT, with passive routing of optical data streams based on their wavelengths.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Reconfigurable NoC ===&lt;br /&gt;
Another field of study is the Software reconfigurable on-chip networks. They are commonly based on the 2D mesh topology. The main idea is to be able to reconfigure the NoC depending on the application and during run-time to react to congestion problems or, in general, adapt to the traffic load. &lt;br /&gt;
&lt;br /&gt;
In [[#References|[12]]], the authors propose a design based on the properties of the  [http://en.wikipedia.org/wiki/Field-programmable_gate_array field-programmable gate array (FPGA)]. It can dynamically implement circuit-switching channels, perform variations in the topology, and reconfigure routing tables. One of the main drawbacks is the overhead that this reconfiguration introduces, although it is designed to minimize it.&lt;br /&gt;
&lt;br /&gt;
=== Bio NoC ===&lt;br /&gt;
Bio NoC or ANoC (Autonomic Network-on-Chip) is based on the concept of the human autonomic nervous system or the human biological immune system. The intention is to provide a NoC with self-organization, self-configuration, and self-healing to dynamically control networking functions. &lt;br /&gt;
&lt;br /&gt;
[[#References|[13]]] presents a collection of chapters/articles from emerging research issues in the ANoC field of application.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
 &lt;br /&gt;
[1] Mirza-Aghatabar, M.; Koohi, S.; Hessabi, S.; Pedram, M.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4341445 &amp;quot;An Empirical Investigation of Mesh and Torus NoC Topologies Under Different Routing Algorithms and Traffic Models,&amp;quot;] Digital System Design Architectures, Methods and Tools, 2007. DSD 2007. 10th Euromicro Conference on , vol., no., pp.19-26, 29-31 Aug. 2007&lt;br /&gt;
&lt;br /&gt;
[2] Ying Ping Zhang; Taikyeong Jeong; Fei Chen; Haiping Wu; Nitzsche, R.; Gao, G.R.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1639301 &amp;quot;A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture,&amp;quot;] Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International , vol., no., pp. 10 pp., 25-29 April 2006&lt;br /&gt;
&lt;br /&gt;
[3] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. [http://www.eecg.toronto.edu/~enright/tilera.pdf On-Chip Interconnection Architecture of the Tile Processor.] IEEE Micro 27, 5 (September 2007), 15-31.&lt;br /&gt;
&lt;br /&gt;
[4] D. N. Jayasimha, B. Zafar, Y. Hoskote. [http://blogs.intel.com/wp-content/mt-content/com/research/terascale/ODI_why-different.pdf On-chip interconnection networks: why they are different and how to compare them.] Technical Report, Intel Corp, 2006&lt;br /&gt;
&lt;br /&gt;
[5] John Kim, James Balfour, and William Dally. [http://cva.stanford.edu/publications/2007/MICRO_FBFLY.pdf Flattened butterfly topology for on-chip networks.] In Proceedings of the 40th International Symposium on Microarchitecture, pages 172–182, December 2007.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
[1] B. Grot and S. W. Keckler. [http://www.cs.utexas.edu/~bgrot/docs/CMP-MSI_08.pdf Scalable on-chip interconnect topologies.] 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects, 2008.&lt;br /&gt;
&lt;br /&gt;
[2] Natalie Enright Jerger and Li-Shiuan Peh. [http://www.morganclaypool.com/doi/abs/10.2200/S00209ED1V01Y200907CAC008?journalCode=cac On-Chip Networks.] Synthesis Lectures on Computer Architecture. 2009, 141 pages. Morgan and Claypool Publishers.&lt;br /&gt;
&lt;br /&gt;
[3] Yan Solihin. (2008). [http://www.cesr.ncsu.edu/solihin/Main.html Fundamentals of parallel computer architecture.] Solihin Pub.&lt;br /&gt;
&lt;br /&gt;
[4] James Balfour and William J. Dally. 2006. [http://www.cs.berkeley.edu.prox.lib.ncsu.edu/~kubitron/courses/cs258-S08/handouts/papers/jbalfour_ICS.pdf Design tradeoffs for tiled CMP on-chip networks.] In Proceedings of the 20th annual international conference on Supercomputing (ICS '06). ACM, New York, NY, USA, 187-198.&lt;br /&gt;
&lt;br /&gt;
[5] Dubois, F.; Cano, J.; Coppola, M.; Flich, J.; Petrot, F.; , [http://www.comcas.eu/publications/Spidergon_STNoC_Design.pdf Spidergon STNoC design flow,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.267-268, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[6] Koohi, S.; Abdollahi, M.; Hessabi, S.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=5948588&amp;amp;isnumber=5948548 All-optical wavelength-routed NoC based on a novel hierarchical topology,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.97-104, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[7] Flich, J.; Duato, J.;, [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=4407676&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4407676 Logic-Based Distributed Routing for NoCs,] 2008 Computer Architecture Letters, vol. 7, no. 1, pp.13-16, Jan 2008&lt;br /&gt;
&lt;br /&gt;
[8] Wu, J.; [http://www.cse.fau.edu/~jie/research/publications/Publication_files/ieeetc0309.pdf A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model,] 2003 IEEE Transactions on Computers, Vol. 52, No. 9, pp.1154-1169, Sept 2003&lt;br /&gt;
&lt;br /&gt;
[9] Veselovsky, G.; Batovski, D.A.; [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1183584&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1183584 A study of the permutation capability of a binary hypercube under deterministic dimension-order routing,] 2003 Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, vol., no., pp.173-177, 5-7 Feb. 2003&lt;br /&gt;
&lt;br /&gt;
[10] Moscibroda, T; Mutlu, O.; [http://research.microsoft.com/pubs/80241/isca_2009-bless.pdf A Case for Bufferless Routing in On-Chip Networks,] ACM SIGARCH Computer Architecture News, Volume 37 Issue 3, June 2009&lt;br /&gt;
&lt;br /&gt;
[11] Fallin, C.; Craik, C.; Mutlu, O.; [http://www.ece.cmu.edu/~safari/pubs/chipper_hpca2011.pdf CHIPPER: A Low-complexity Bufferless Deflection Router,] Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA 2011), San Antonio, TX, February 2011.&lt;br /&gt;
&lt;br /&gt;
[12] V. Rana, et al., [http://infoscience.epfl.ch/record/130661/files/paperM2B-VLSI-SoC2008%5b1%5d.pdf A Reconfigurable Network-on-Chip Architecture for Optimal Multi-Processor SoC Communication,] in VLSI-SoC, 2009.&lt;br /&gt;
&lt;br /&gt;
[13] Cong-Vinh, P. (December 2011). [http://www.crcpress.com/product/isbn/9781439829110 Autonomic networking-on-chip: Bio-inspired specification, development, and verification.] CRC Press.&lt;br /&gt;
&lt;br /&gt;
[14] S. Kumar, et al., [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1016885&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1016885 A Network on Chip Architecture and Design Methodology,] VLSI on Annual Symposium, IEEE Computer Society ISVLSI 2002.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
1. Advantage of 2-D Mesh&lt;br /&gt;
&lt;br /&gt;
a) simple design&lt;br /&gt;
&lt;br /&gt;
b) cumbersome design&lt;br /&gt;
&lt;br /&gt;
c) degree is the same for all nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
2. Diameter is&lt;br /&gt;
&lt;br /&gt;
a) minimum hop count&lt;br /&gt;
&lt;br /&gt;
b) maximum hop count&lt;br /&gt;
&lt;br /&gt;
c) number of neighbors &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
3. SOC stands for&lt;br /&gt;
&lt;br /&gt;
a) System of Chips&lt;br /&gt;
&lt;br /&gt;
b) Switch of Cores&lt;br /&gt;
&lt;br /&gt;
c) System on a Chip&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4. In a direct topology,&lt;br /&gt;
&lt;br /&gt;
a) each node contains a network interface acting as a router in order to transfer information&lt;br /&gt;
&lt;br /&gt;
b) there are nodes that act as routers&lt;br /&gt;
&lt;br /&gt;
c) only one node is a computational nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5. The Single-Chip Cloud Computer contains &lt;br /&gt;
&lt;br /&gt;
a) an 8x10 mesh&lt;br /&gt;
&lt;br /&gt;
b) a 64-router mesh network&lt;br /&gt;
&lt;br /&gt;
c) a 24-router mesh network&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
6. A deterministic routing scheme uses algorithms to determine the most advantageous path to the target node.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
7. Livelock is necessary to maintain coherence in routing protocols.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
8. Dimension Order routing&lt;br /&gt;
&lt;br /&gt;
a) is only possible with 2D mesh-based topologies.&lt;br /&gt;
&lt;br /&gt;
b) attempts to route all packets in one dimension before starting another.&lt;br /&gt;
&lt;br /&gt;
c) uses routing tables to find the packet destination.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
9. Source routing&lt;br /&gt;
&lt;br /&gt;
a) includes information in the packet about the destination node&lt;br /&gt;
&lt;br /&gt;
b) uses routing information calculated by the sending node&lt;br /&gt;
&lt;br /&gt;
c) all of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
10. Store and forward routing&lt;br /&gt;
&lt;br /&gt;
a) requires the entire message to be broken into regular sized pieces and sent over the network&lt;br /&gt;
&lt;br /&gt;
b) is an optimal routing protocol&lt;br /&gt;
&lt;br /&gt;
c) buffers the entire message in each node along the route before sending it to the next node&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/12b_sb&amp;diff=74917</id>
		<title>CSC/ECE 506 Spring 2012/12b sb</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/12b_sb&amp;diff=74917"/>
		<updated>2013-04-17T18:35:12Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;On-chip interconnects&lt;br /&gt;
__TOC__ &lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The current trend in microprocessor design has shifted from extracting ever increasing performance gains from single core architecture to leveraging the power of multiple cores per die.  This creates new challenges not present in single core systems.  A multi core processor must have a method of passing information between processing cores that is efficient in terms of power consumed, space used on die, and the speed at which messages are delivered.  As physical wire widths are decreased and the number of wires is increased, the difference between gate delay and wire delay is exacerbated.[[#References|[14]]]  To combat these challenges, much research has been done in the area of on-chip networks.&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
On-chip interconnects are a natural extension of the high integration levels that nowadays are reached with multiprocessor integration. Moore's law predicted that the number of transistors in an integrated circuit doubles every two years. This assumption has driven the integration of on-chip components and continues to show the way in the semiconductor industry.&lt;br /&gt;
[[File:Itr MIC image 920x460.png|thumb|c|right|Intel® MIC]]&lt;br /&gt;
In recent years, the main players in the chip industry keep racing to provide more cores integrated in a chip, with the multi-core (more than one core) and many-core (multi-core with so many cores that the historical multi-core techniques are not efficient any longer) technologies. This integration is known as [http://en.wikipedia.org/wiki/Multi-core_(computing) CMP] (chip multiprocessor) and lately Intel has coined the term [http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html Intel® Many Integrated Core (Intel® MIC)].&lt;br /&gt;
&lt;br /&gt;
To make feasible the communication in between these many cores inside of a single chip, the traditional off-chip network has proved to have limited applications. According to [[#References|[2]]], the off-chip designs suffered from I/O bottlenecks which are a diminished problem for on-chip technologies as the internal wiring provides much higher bandwidth and overcomes the delay associated with the external traffic. Nevertheless, the on-chip designs still have some challenges that need further study. Among some of these issues are power consumption and space constraints.&lt;br /&gt;
&lt;br /&gt;
=== Terminology ===&lt;br /&gt;
Some common terms:&lt;br /&gt;
* [http://en.wikipedia.org/wiki/System_on_a_chip SoCs] (Systems-on-a-chip), which commonly refer to chips that are made for a specific application or domain area.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/MPSoC MPSoCs] (Multiprocessor systems-on-chip), referring to a SoC that uses multi-core technology.&lt;br /&gt;
It is interesting to note that for the particular theme of this article, there are at least three different acronyms referring to the same term. These are new technologies and different researchers have adopted different nomenclature. The acronyms are:&lt;br /&gt;
* NoC (network-on-chip), this is the most common term and also used in this article&lt;br /&gt;
* OCIN (on-chip interconnection network) &lt;br /&gt;
* OCN (on-chip network)&lt;br /&gt;
&lt;br /&gt;
== Topologies ==&lt;br /&gt;
Topology refers to the layout or arrangement of interconnections among the processing elements. In general, a good topology aims to minimize network latency and maximize throughput.&lt;br /&gt;
There are certain metrics that help with the classification and comparison of the different topology types. Some of them are defined in Solihin's [[#References|[3]]] textbook in chapter 12.&lt;br /&gt;
&lt;br /&gt;
*'''Degree''' is defined as the number of nodes that are neighbors to, or in other words, can be reached from it in one hop&lt;br /&gt;
*'''Hop count''' is the number of nodes through which a message needs to go through to get to the destination&lt;br /&gt;
*'''Diameter''' is the maximum hop count&lt;br /&gt;
*'''Path diversity''' is useful for the routing algorithm and is given by the amount of shortest paths that a topology offers between two nodes.&lt;br /&gt;
*'''Bisection width''' is the smallest number of wires you have to cut to separate the network into two halves&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Topologies can be classified as direct and indirect topologies.&lt;br /&gt;
In a direct topology, each node is connected to other nodes, which are named neighbouring nodes. Each node contains a network interface acting as a router in order to transfer information.&lt;br /&gt;
In an indirect topology, there are nodes that are no computational but act as switches to transfer the traffic among the rest of the nodes, including other switches. It is called indirect because packets are switched through specific elements that are not part of the computational nodes themselves.&lt;br /&gt;
&lt;br /&gt;
An example of direct topologies is 2-D Mesh. An example of indirect topology is Flattened Butterfly.  &lt;br /&gt;
&lt;br /&gt;
There are many different topologies that could be introduced in this section. Some of the missing topologies include but are not limited to:&lt;br /&gt;
&lt;br /&gt;
* Hypercube&lt;br /&gt;
* Shuffle-exchange&lt;br /&gt;
* Torus&lt;br /&gt;
* Trees&lt;br /&gt;
&lt;br /&gt;
They are just cited here for completion, related information can be found at [http://www.cs.cf.ac.uk/Parallel/Year2/section5.html Interconnection Networks]&lt;br /&gt;
&lt;br /&gt;
=== 2-D Mesh ===&lt;br /&gt;
[[File:Mesh.png|thumb|c|right|upright=0.75|2D Mesh]]This has been a very popular topology due to its simple design and low layout and router complexity. It is often described as a k-ary n-cube , where k is the number of nodes on each dimension, and n is the number of dimensions. For example, a 4-ary 2-cube is a 4x4 2D mesh.&lt;br /&gt;
Another advantage is that this topology is similar to the physical die layout, making it more suitable to implement in tiled architectures. For reference, the combination of the switch and a processor is named ''tile''.&lt;br /&gt;
&lt;br /&gt;
But not everything are advantages in this topology. One of the drawbacks of 2D Meshes is that the degree of the nodes along the edges is lower than the degree of the central nodes. This makes the 2D Mesh asymmetrical along the edges, therefore from the networking perspective, there is less demand for edge channels than for central channels.&lt;br /&gt;
&lt;br /&gt;
Jerger and Peh [[#References|[2]]], provide the following information on parameters for a mesh as defined as a k-ary n-cube:&lt;br /&gt;
*the switch degree for a 2D mesh would be 4, as its network requires two channels in each dimension or 2n, although some ports on the edge will be unused.&lt;br /&gt;
*average minimum hop count: &lt;br /&gt;
:{| {{table}}&lt;br /&gt;
| nk/3|| ||k even&lt;br /&gt;
|-&lt;br /&gt;
| n(k/3-1/3k)|| ||k odd&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
*the channel load across the bisection of a mesh under uniform random traffic with an even k is k/4&lt;br /&gt;
*meshes provide diversity of paths for routing messages&lt;br /&gt;
&lt;br /&gt;
=== Concentration Mesh ===&lt;br /&gt;
[[File:Concentratedmesh.png|thumb|c|right|upright=0.75|Concentration Mesh]] This is an evolution of the mesh topology. There is no real need to have a 1:1 relationship between the number of cores and the number of switches/routers. The Concentration mesh reduces the ratio to 1:4, i.e. each router serves four computing nodes. &lt;br /&gt;
&lt;br /&gt;
The advantage over the simple mesh is the decrease in the average hop count. This is important in terms of scaling the solution. But it is not as scalable as it could seem, as its degree is confined to the crossbar complexity [[#References|[1]]]&lt;br /&gt;
&lt;br /&gt;
The reduction in the ratio introduces a lower bisection channel count, but it can be avoided by introducing express channels, as demonstrated in [[#References|[4]]].&lt;br /&gt;
&lt;br /&gt;
Another drawback is that the port bandwidth can become a bottleneck in periods of high traffic.&lt;br /&gt;
&lt;br /&gt;
=== Flattened Butterfly ===&lt;br /&gt;
[[File:Flbfly.png|thumb|c|right|upright=0.75|Flattened butterfly]]A butterfly topology is often described as a k-ary n-fly, which implies k&amp;lt;sup&amp;gt;n&amp;lt;/sup&amp;gt; network nodes with n stages of k&amp;lt;sup&amp;gt;n−1&amp;lt;/sup&amp;gt; k × k intermediate routing nodes. The degree of each intermediate router is 2k.  &lt;br /&gt;
&lt;br /&gt;
The ﬂattened butterﬂy is made by ﬂattening (i.e. combining) the routers in each row of a butterﬂy topology while preserving the inter-router connections. It does non-minimal routing for load balancing improvement in the network.&lt;br /&gt;
&lt;br /&gt;
Some advantages are that the maximum distance between nodes is two hops and it has lower latency and better throughput than that of the mesh topology.&lt;br /&gt;
&lt;br /&gt;
For the disadvantages, it has high channel count (k&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;/2 per row/column), low channel utilization, and increased control complexity.&lt;br /&gt;
&lt;br /&gt;
=== Multidrop Express Channels (MECS) ===&lt;br /&gt;
[[File:Mecs.png|thumb|c|right|upright=0.75|MECS]] Multidrop Express Channels was proposed in [[#References|[1]]] by Grot and Keckler. Their motivation was that performance and scalability should be obtained by managing wiring. &lt;br /&gt;
Multidrop Express Channels is defined by its authors as a &amp;quot;one to-many communication fabric that enables a high degree of connectivity in a bandwidth-efﬁcient manner.&amp;quot;  Based on point-to-point unidirectional links. This makes for a high degree of connectivity with fewer bisection channels and higher bandwidth for each channel. &lt;br /&gt;
&lt;br /&gt;
Some of the parameters calculated for MECS are:&lt;br /&gt;
*Bisection channel count per each row/column is equal to k.&lt;br /&gt;
*Network diameter (maximum hop count) is two.&lt;br /&gt;
*The number of nodes accessible through each channel ranges from 1 to k − 1.&lt;br /&gt;
*A node has 1 output port per direction&lt;br /&gt;
*The input port count is 2(k − 1)&lt;br /&gt;
&lt;br /&gt;
The low channel count and the high degree of connectivity provided by each channel increase per channel bandwidth and wire utilization. At the same time, the design minimizes the serialization delay. It presents low network latencies due to its low diameter.&lt;br /&gt;
&lt;br /&gt;
=== Comparison of topologies ===&lt;br /&gt;
This data is taken from the analysis done in [[#References|[1]]]. &lt;br /&gt;
&lt;br /&gt;
[[File:Topologycomp.png|thumbnail|center|upright=5|Comparison of CMesh, Flattened Butterfly, and MECS]]&lt;br /&gt;
&lt;br /&gt;
The information in this table compares three of the topologies described above for two combinations of k which is the network radix (nodes/dimension) and c (concentration factor, 1 being no concentration). &lt;br /&gt;
&lt;br /&gt;
Maximum hop count is 2 for flattened butterfly and MECS, whereas is directly proportional to k in the case of Concentrated Mesh, what makes flattened butterfly and MECS better solutions with less network latency.&lt;br /&gt;
&lt;br /&gt;
The bisection channels is 1 for CMesh in both cases, but it gets doubled and even quadrupled between MECS and flattened butterfly. &lt;br /&gt;
&lt;br /&gt;
The bandwidth per channel in this example is better for CMesh and MECS, getting attenuated in the case of flattened butterfly.&lt;br /&gt;
&lt;br /&gt;
=== Examples of topologies in current NoCs ===&lt;br /&gt;
&lt;br /&gt;
==== Intel ====&lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=151 Intel Teraflops Research Chip] is made of an 8x10 mesh, and two 38-bit unidirectional links per channel. It has a bisection bandwidth of 380 GB/s, this includes data and sideband communication. There is a 5-port router inside of each of the computing nodes and the communication is carried out through message-passing. Its name comes from the one trillion mathematical calculations per second (1 Teraflops) of performance, accomplished with the 80 simple cores with each containing 2 floating point units and all of this consuming only 62 watts (less than many other processors).&lt;br /&gt;
 &lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=1 Single-Chip Cloud Computer] contains a 24-router mesh network with 256 GB/s bisection bandwidth. This design contains 48 fully functional cores and consumes only 25 watts. This newer model is more complete than the Teraflops Research model. It is full programmable and used for research by academia and private companies.&lt;br /&gt;
&lt;br /&gt;
==== Tilera ====&lt;br /&gt;
&lt;br /&gt;
Tilera [http://www.tilera.com/products/processors Tilera TileGx, TilePro, and Tile64]  is a fabless semiconductor company that has developed a &amp;quot;tile processor&amp;quot; whereby the fabrication of the multi-processor device is greatly simplified by the placement of processor &amp;quot;tiles&amp;quot; on the die. The technology behind this innovation is iMesh, which is the name of the on-chip interconnection technology used in the Tile Processor's architecture&amp;lt;ref name=&amp;quot;Tilera&amp;quot;&amp;gt;&amp;quot;On-Chip Interconnection Architecture of the Tile Processor,&amp;quot; Wentzlaff, et al. 2007. IEEE Xplore.&amp;lt;/ref&amp;gt;.  The iMesh™ consists of five 8x8 independent mesh networks with two 32-bit unidirectional links per channel. The Tile Processor is innovative due to its highly scalable implementation of an on-chip network that utilizes 2D meshes. These are physically organized (as opposed to logically organized) due to design considerations when scaling and laying out new designs.It provides a bisection bandwidth of 320GB/s.&lt;br /&gt;
The tiles that conform the Tilera designs contain a complete processor with L1 and L2 caches. And each one can run an operating system in an independent manner or several tiles can run, together as a whole, an operating system like SMP Linux, for example.&lt;br /&gt;
&lt;br /&gt;
==== ST Microelectronics ====&lt;br /&gt;
[[File:Spidergon.png|thumb|c|right|upright=1.5|Example of Spidergon design]]&lt;br /&gt;
ST Microelectronics created the Spidergon design for the STNoC [[#References|[5]]]. &lt;br /&gt;
&lt;br /&gt;
The Spidergon is a pseudo-regular topology with a design that is composed of three building blocks: network interface, router, and physical link. These building blocks make the design ready to be tailored to the needs of the application. Each router building block has a degree of 3.&lt;br /&gt;
&lt;br /&gt;
The 3 building blocks can be used to create the specific design needed, with the input/output ports that the application requires. The blocks can be configured and stored in a library for creating the design. In the picture on the right, the example contains 2 of the building blocks (router and network interface) and a third undisclosed block.&lt;br /&gt;
&lt;br /&gt;
==== IBM ====&lt;br /&gt;
The IBM [http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html Cell] project uses an interconnect with four unidirectional 16B-wide data rings, two in each direction. The name of the interconnect is the Element Interconnect Bus (EIB) and allows for communication among the different components of the Cell, among them and with the external I/O. The total network bisection bandwidth is 307.2 GB/s. &lt;br /&gt;
&lt;br /&gt;
As a curiosity, the Cell processor was jointly developed with Sony and Toshiba, and is [http://en.wikipedia.org/wiki/Cell_(microprocessor) used] in the [http://news.cnet.com/PlayStation-3-chip-has-split-personality/2100-1043_3-5566340.html?tag=nl Sony PlayStation 3]. The Cell consists of a PowerPC core which manages eight synergistic processing engines (SPEs) that can be used for floating-point calculations. These calculations provide the engine for better gaming systems.&lt;br /&gt;
&lt;br /&gt;
== Routing ==&lt;br /&gt;
&lt;br /&gt;
There are a variety of routing protocols that can be used for [http://en.wikipedia.org/wiki/System_on_a_chip SoC's], each having different advantages and disadvantages.  They can be broadly classified in several different ways.&lt;br /&gt;
&lt;br /&gt;
===General Routing Schemes===&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Store_and_forward Store and forward routing]==== &lt;br /&gt;
This routing scheme has been used since the early days of telecommunications.  It requires that the entire message be received at a node prior before it is propagated to the next node.  This protocol suffers from a high storage requirement and high latency, due to the need to completely buffer a message before forwarding it.[[#References|[7]]]  This approach can be quite effective when the average packet size is small in comparison with the channel widths.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Cut-through_switching Cut-Through routing] or [http://en.wikipedia.org/wiki/Wormhole_switching Worm Hole routing]====&lt;br /&gt;
These two protocols uses the switch to examine the flit header, decide where to send the message, and then start forwarding it immediately.  True cut-through routing lets the tail continue when the head is blocked, stacking message packets into a single switch (which requires a buffer large enough to hold the largest packet).  In worm hole routing, when the head of the message is blocked the message stays strung out over multiple nodes in the network, potentially blocking other messages (however, this needs only enough buffer space to store the piece of the packet that is sent between switches).  Using a cut-through protocol lowers latency but can suffer from packet corruption and must implement a scheme to handle this.[[#References|[7]]]&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Deterministic_routing Deterministic routing]====&lt;br /&gt;
This describes a routing scheme where, if we are given a pair of nodes, the same path will always be used between those nodes.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Adaptive_routing Adaptive routing]====&lt;br /&gt;
This is a routing scheme where the underlying routers may alter the path of packet flow in response to system conditions or other algorithm criteria.  Adaptive routing is intended to provide as many routes as possible to reach the destination.&lt;br /&gt;
&lt;br /&gt;
====Deadlock and Livelock====&lt;br /&gt;
&lt;br /&gt;
Deadlock and livelock are two separate situations that may occur during routing, both resulting in packets never reaching their destination.  They are defined as follows:&lt;br /&gt;
&lt;br /&gt;
''' Deadlock ''' is defined as a situation where there are activities (e.g., messages) each waiting for another to finish something.[[#References|[8]]] Since a waiting activity cannot finish, the messages are deadlocked.  This is analogous to the [http://en.wikipedia.org/wiki/Dining_philosophers_problem Dining Philosophers Problem], each deadlocked message is waiting on the result of another deadlocked message, and none are able to reach their destination.&lt;br /&gt;
&lt;br /&gt;
''' Livelock ''' is defined as a situation where a message can move from node to node but will never reach their destination node.[[#References|[8]]]  This is similar to deadlock in that the message never reaches its destination, but the message is still able to travel through portions of the network, making hops but never reaching its target.  This is analogous to a process spinning while waiting, the process itself is doing meaningless work but it is still active.  &lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
===Routing Protocols in SoC's===&lt;br /&gt;
&lt;br /&gt;
The specific routing protocols below are built using the ideas from the classes of protocols previously described.&lt;br /&gt;
&lt;br /&gt;
==== Source Routing ====&lt;br /&gt;
&lt;br /&gt;
The source node partially or totally computes the path a packet will take through the network and stores the information in the packet header.  The extra route information is sent in each packet, inflating their size.&lt;br /&gt;
&lt;br /&gt;
==== Distributed Routing ====&lt;br /&gt;
&lt;br /&gt;
Each switch in the network computes the next route that will be taking towards the destination.  The packet header contains only the destination information, reducing its size compared to source routing.  This approach requires routing tables to be present to direct the packet from node to node, which does not scale well when the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
==== Logic Based Distributed Routing (LBDR) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, routing is achieved by each router knowing its position in the architecture and being able to determine what direction it is from the destination of the packet.  It is most commonly used in 2D meshes, but it can be applied to other topologies as well.[[#References|[7]]]  Using this position information, it is possible to route the packet based on a small number of bits and a few logic gates per router, which saves over a table or a buffer.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are several variations of LBDR&lt;br /&gt;
&lt;br /&gt;
''' LBDRe ''' - This variation models up to two future hops before deciding where to send the packet next.    &lt;br /&gt;
&lt;br /&gt;
''' uLBDR (Universal LBDR) ''' - This variation adds packet multicast support to the protocol.&lt;br /&gt;
&lt;br /&gt;
''' bLBDR ''' - This variation adds the ability to broadcast messages to only certain regions (segments) of the network.&lt;br /&gt;
&lt;br /&gt;
==== Bufferless Deflection Routing (BLESS protocol) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, each flit of a packet is routed independently of every other flit through the network, and different flits from the same packet may take different paths.  Any contention between multiple flits results in one flit taking the desired path and the other flit being “deflected” to some other router.  This may result in undesirable routing, but the packets will eventually reach the destination.[[#References|[10]]]  This type of routing is feasible on every network topology that satisfies the following two constraints: Every router has at least the same number of output ports as the number of its input ports, and every router is reachable from every other router.[[#References|[10]]]  &lt;br /&gt;
&lt;br /&gt;
==== CHIPPER (Cheap-Interconnect Partially Permuting Router) ====&lt;br /&gt;
&lt;br /&gt;
This protocol was designed to address inefficient port allocation in the BLESS protocol.  A permutation network directs deflected flits to free output ports.  By limiting the requirements so that only that the highest-priority flit obtains its request, we can prevent livelock.  In the case of contention, arbitration logic chooses a winning flit.  It does this by choosing a single packet, and prioritize that packet globally above all other packets for long enough that its delivery is ensured.  Every packet in the system eventually receives this special status, so every packet is eventually delivered (the Golden Packet scheme).[[#References|[11]]]&lt;br /&gt;
&lt;br /&gt;
==== Dimension-order Routing ====&lt;br /&gt;
&lt;br /&gt;
This protocol is a deterministic strategy for multidimensional networks.  Each direction is chosen in order and routed completely before switching to the next direction.  For example, in a 2D mesh, dimension order routing could be implemented by completely routing the packet in the X-dimension before beginning to route in the Y-dimension.  This is extensible to higher order connections as well, for example, hypercubes can be routed in dimension order by routing packets along the dimensions in the order of different bit positions of the source and destination address, one bit position at a time.[[#References|[9]]]&lt;br /&gt;
&lt;br /&gt;
== Lines of Research ==&lt;br /&gt;
From NoCs perspective, there are many lines of research besides the abundant of technologies of the commercial designs. Some of them are presented in this section.&lt;br /&gt;
&lt;br /&gt;
=== Optical on-chip interconnects ===&lt;br /&gt;
IBM has been performing extensive research on photonic layer inside of the CMP used not only for connecting several cores, but also for routing traffic: [http://researcher.ibm.com/view_project.php?id=2757 Silicon Integrated Nanophotonics.] This technology was actually used in the IBM Cell chip that was mentioned in above sections. The main advantages are reliability and power efficiency.&lt;br /&gt;
&lt;br /&gt;
This [http://www.research.ibm.com/photonics/publications/ecoc_tutorial_2008.pdf tutorial] explains some differences between electronics and photonics in terms of power consumption, the more efficient is the computing from power's perspective, the more FLOPs per Watt:&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Electronics'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Photonics'''&lt;br /&gt;
|-&lt;br /&gt;
| Electronic network ~500W||Optic network &amp;lt;80W&lt;br /&gt;
|-&lt;br /&gt;
| power = bandwidth x length||power does not depend on bitrate nor length&lt;br /&gt;
|-&lt;br /&gt;
| buffer on chip that rx and re-tx every bit at every switch||rx (modulate) data once, without having to re-tx&lt;br /&gt;
|-&lt;br /&gt;
| ||switching fabric has almost no power dissipation&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In academia, there are articles like [[#References|[6]]] which proposes a new topology created for optical on-chip interconnects. They refer to previous papers that cite adaptations of well-known electronic designs, but highlight the need to provide a &amp;quot;scalable all-optical NoC, referred to as 2D-HERT, with passive routing of optical data streams based on their wavelengths.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Reconfigurable NoC ===&lt;br /&gt;
Another field of study is the Software reconfigurable on-chip networks. They are commonly based on the 2D mesh topology. The main idea is to be able to reconfigure the NoC depending on the application and during run-time to react to congestion problems or, in general, adapt to the traffic load. &lt;br /&gt;
&lt;br /&gt;
In [[#References|[12]]], the authors propose a design based on the properties of the  [http://en.wikipedia.org/wiki/Field-programmable_gate_array field-programmable gate array (FPGA)]. It can dynamically implement circuit-switching channels, perform variations in the topology, and reconfigure routing tables. One of the main drawbacks is the overhead that this reconfiguration introduces, although it is designed to minimize it.&lt;br /&gt;
&lt;br /&gt;
=== Bio NoC ===&lt;br /&gt;
Bio NoC or ANoC (Autonomic Network-on-Chip) is based on the concept of the human autonomic nervous system or the human biological immune system. The intention is to provide a NoC with self-organization, self-configuration, and self-healing to dynamically control networking functions. &lt;br /&gt;
&lt;br /&gt;
[[#References|[13]]] presents a collection of chapters/articles from emerging research issues in the ANoC field of application.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
 &lt;br /&gt;
[1] Mirza-Aghatabar, M.; Koohi, S.; Hessabi, S.; Pedram, M.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4341445 &amp;quot;An Empirical Investigation of Mesh and Torus NoC Topologies Under Different Routing Algorithms and Traffic Models,&amp;quot;] Digital System Design Architectures, Methods and Tools, 2007. DSD 2007. 10th Euromicro Conference on , vol., no., pp.19-26, 29-31 Aug. 2007&lt;br /&gt;
&lt;br /&gt;
[2] Ying Ping Zhang; Taikyeong Jeong; Fei Chen; Haiping Wu; Nitzsche, R.; Gao, G.R.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1639301 &amp;quot;A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture,&amp;quot;] Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International , vol., no., pp. 10 pp., 25-29 April 2006&lt;br /&gt;
&lt;br /&gt;
[3] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. [http://www.eecg.toronto.edu/~enright/tilera.pdf On-Chip Interconnection Architecture of the Tile Processor.] IEEE Micro 27, 5 (September 2007), 15-31.&lt;br /&gt;
&lt;br /&gt;
[4] D. N. Jayasimha, B. Zafar, Y. Hoskote. [http://blogs.intel.com/wp-content/mt-content/com/research/terascale/ODI_why-different.pdf On-chip interconnection networks: why they are different and how to compare them.] Technical Report, Intel Corp, 2006&lt;br /&gt;
&lt;br /&gt;
[5] John Kim, James Balfour, and William Dally. [http://cva.stanford.edu/publications/2007/MICRO_FBFLY.pdf Flattened butterfly topology for on-chip networks.] In Proceedings of the 40th International Symposium on Microarchitecture, pages 172–182, December 2007.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
[1] B. Grot and S. W. Keckler. [http://www.cs.utexas.edu/~bgrot/docs/CMP-MSI_08.pdf Scalable on-chip interconnect topologies.] 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects, 2008.&lt;br /&gt;
&lt;br /&gt;
[2] Natalie Enright Jerger and Li-Shiuan Peh. [http://www.morganclaypool.com/doi/abs/10.2200/S00209ED1V01Y200907CAC008?journalCode=cac On-Chip Networks.] Synthesis Lectures on Computer Architecture. 2009, 141 pages. Morgan and Claypool Publishers.&lt;br /&gt;
&lt;br /&gt;
[3] Yan Solihin. (2008). [http://www.cesr.ncsu.edu/solihin/Main.html Fundamentals of parallel computer architecture.] Solihin Pub.&lt;br /&gt;
&lt;br /&gt;
[4] James Balfour and William J. Dally. 2006. [http://www.cs.berkeley.edu.prox.lib.ncsu.edu/~kubitron/courses/cs258-S08/handouts/papers/jbalfour_ICS.pdf Design tradeoffs for tiled CMP on-chip networks.] In Proceedings of the 20th annual international conference on Supercomputing (ICS '06). ACM, New York, NY, USA, 187-198.&lt;br /&gt;
&lt;br /&gt;
[5] Dubois, F.; Cano, J.; Coppola, M.; Flich, J.; Petrot, F.; , [http://www.comcas.eu/publications/Spidergon_STNoC_Design.pdf Spidergon STNoC design flow,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.267-268, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[6] Koohi, S.; Abdollahi, M.; Hessabi, S.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=5948588&amp;amp;isnumber=5948548 All-optical wavelength-routed NoC based on a novel hierarchical topology,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.97-104, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[7] Flich, J.; Duato, J.;, [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=4407676&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4407676 Logic-Based Distributed Routing for NoCs,] 2008 Computer Architecture Letters, vol. 7, no. 1, pp.13-16, Jan 2008&lt;br /&gt;
&lt;br /&gt;
[8] Wu, J.; [http://www.cse.fau.edu/~jie/research/publications/Publication_files/ieeetc0309.pdf A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model,] 2003 IEEE Transactions on Computers, Vol. 52, No. 9, pp.1154-1169, Sept 2003&lt;br /&gt;
&lt;br /&gt;
[9] Veselovsky, G.; Batovski, D.A.; [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1183584&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1183584 A study of the permutation capability of a binary hypercube under deterministic dimension-order routing,] 2003 Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, vol., no., pp.173-177, 5-7 Feb. 2003&lt;br /&gt;
&lt;br /&gt;
[10] Moscibroda, T; Mutlu, O.; [http://research.microsoft.com/pubs/80241/isca_2009-bless.pdf A Case for Bufferless Routing in On-Chip Networks,] ACM SIGARCH Computer Architecture News, Volume 37 Issue 3, June 2009&lt;br /&gt;
&lt;br /&gt;
[11] Fallin, C.; Craik, C.; Mutlu, O.; [http://www.ece.cmu.edu/~safari/pubs/chipper_hpca2011.pdf CHIPPER: A Low-complexity Bufferless Deflection Router,] Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA 2011), San Antonio, TX, February 2011.&lt;br /&gt;
&lt;br /&gt;
[12] V. Rana, et al., [http://infoscience.epfl.ch/record/130661/files/paperM2B-VLSI-SoC2008%5b1%5d.pdf A Reconfigurable Network-on-Chip Architecture for Optimal Multi-Processor SoC Communication,] in VLSI-SoC, 2009.&lt;br /&gt;
&lt;br /&gt;
[13] Cong-Vinh, P. (December 2011). [http://www.crcpress.com/product/isbn/9781439829110 Autonomic networking-on-chip: Bio-inspired specification, development, and verification.] CRC Press.&lt;br /&gt;
&lt;br /&gt;
[14] S. Kumar, et al., [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1016885&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1016885 A Network on Chip Architecture and Design Methodology,] VLSI on Annual Symposium, IEEE Computer Society ISVLSI 2002.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
1. Advantage of 2-D Mesh&lt;br /&gt;
&lt;br /&gt;
a) simple design&lt;br /&gt;
&lt;br /&gt;
b) cumbersome design&lt;br /&gt;
&lt;br /&gt;
c) degree is the same for all nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
2. Diameter is&lt;br /&gt;
&lt;br /&gt;
a) minimum hop count&lt;br /&gt;
&lt;br /&gt;
b) maximum hop count&lt;br /&gt;
&lt;br /&gt;
c) number of neighbors &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
3. SOC stands for&lt;br /&gt;
&lt;br /&gt;
a) System of Chips&lt;br /&gt;
&lt;br /&gt;
b) Switch of Cores&lt;br /&gt;
&lt;br /&gt;
c) System on a Chip&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4. In a direct topology,&lt;br /&gt;
&lt;br /&gt;
a) each node contains a network interface acting as a router in order to transfer information&lt;br /&gt;
&lt;br /&gt;
b) there are nodes that act as routers&lt;br /&gt;
&lt;br /&gt;
c) only one node is a computational nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5. The Single-Chip Cloud Computer contains &lt;br /&gt;
&lt;br /&gt;
a) an 8x10 mesh&lt;br /&gt;
&lt;br /&gt;
b) a 64-router mesh network&lt;br /&gt;
&lt;br /&gt;
c) a 24-router mesh network&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
6. A deterministic routing scheme uses algorithms to determine the most advantageous path to the target node.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
7. Livelock is necessary to maintain coherence in routing protocols.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
8. Dimension Order routing&lt;br /&gt;
&lt;br /&gt;
a) is only possible with 2D mesh-based topologies.&lt;br /&gt;
&lt;br /&gt;
b) attempts to route all packets in one dimension before starting another.&lt;br /&gt;
&lt;br /&gt;
c) uses routing tables to find the packet destination.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
9. Source routing&lt;br /&gt;
&lt;br /&gt;
a) includes information in the packet about the destination node&lt;br /&gt;
&lt;br /&gt;
b) uses routing information calculated by the sending node&lt;br /&gt;
&lt;br /&gt;
c) all of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
10. Store and forward routing&lt;br /&gt;
&lt;br /&gt;
a) requires the entire message to be broken into regular sized pieces and sent over the network&lt;br /&gt;
&lt;br /&gt;
b) is an optimal routing protocol&lt;br /&gt;
&lt;br /&gt;
c) buffers the entire message in each node along the route before sending it to the next node&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/12b_sl&amp;diff=74916</id>
		<title>CSC/ECE 506 Spring 2013/12b sl</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/12b_sl&amp;diff=74916"/>
		<updated>2013-04-17T18:06:37Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: Created page with &amp;quot;On-chip interconnects __TOC__   == Introduction ==  The current trend in microprocessor design has shifted from extracting ever increasing performance gains from single core arch...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;On-chip interconnects&lt;br /&gt;
__TOC__ &lt;br /&gt;
&lt;br /&gt;
== Introduction ==&lt;br /&gt;
&lt;br /&gt;
The current trend in microprocessor design has shifted from extracting ever increasing performance gains from single core architecture to leveraging the power of multiple cores per die.  This creates new challenges not present in single core systems.  A multi core processor must have a method of passing information between processing cores that is efficient in terms of power consumed, space used on die, and the speed at which messages are delivered.  As physical wire widths are decreased and the number of wires is increased, the difference between gate delay and wire delay is exacerbated.[[#References|[14]]]  To combat these challenges, much research has been done in the area of on-chip networks.&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
On-chip interconnects are a natural extension of the high integration levels that nowadays are reached with multiprocessor integration. Moore's law predicted that the number of transistors in an integrated circuit doubles every two years. This assumption has driven the integration of on-chip components and continues to show the way in the semiconductor industry.&lt;br /&gt;
[[File:Itr MIC image 920x460.png|thumb|c|right|Intel® MIC]]&lt;br /&gt;
In recent years, the main players in the chip industry keep racing to provide more cores integrated in a chip, with the multi-core (more than one core) and many-core (multi-core with so many cores that the historical multi-core techniques are not efficient any longer) technologies. This integration is known as [http://en.wikipedia.org/wiki/Multi-core_(computing) CMP] (chip multiprocessor) and lately Intel has coined the term [http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html Intel® Many Integrated Core (Intel® MIC)].&lt;br /&gt;
&lt;br /&gt;
To make feasible the communication in between these many cores inside of a single chip, the traditional off-chip network has proved to have limited applications. According to [[#References|[2]]], the off-chip designs suffered from I/O bottlenecks which are a diminished problem for on-chip technologies as the internal wiring provides much higher bandwidth and overcomes the delay associated with the external traffic. Nevertheless, the on-chip designs still have some challenges that need further study. Among some of these issues are power consumption and space constraints.&lt;br /&gt;
&lt;br /&gt;
=== Terminology ===&lt;br /&gt;
Some common terms:&lt;br /&gt;
* [http://en.wikipedia.org/wiki/System_on_a_chip SoCs] (Systems-on-a-chip), which commonly refer to chips that are made for a specific application or domain area.&lt;br /&gt;
* [http://en.wikipedia.org/wiki/MPSoC MPSoCs] (Multiprocessor systems-on-chip), referring to a SoC that uses multi-core technology.&lt;br /&gt;
It is interesting to note that for the particular theme of this article, there are at least three different acronyms referring to the same term. These are new technologies and different researchers have adopted different nomenclature. The acronyms are:&lt;br /&gt;
* NoC (network-on-chip), this is the most common term and also used in this article&lt;br /&gt;
* OCIN (on-chip interconnection network) &lt;br /&gt;
* OCN (on-chip network)&lt;br /&gt;
&lt;br /&gt;
== Topologies ==&lt;br /&gt;
Topology refers to the layout or arrangement of interconnections among the processing elements. In general, a good topology aims to minimize network latency and maximize throughput.&lt;br /&gt;
There are certain metrics that help with the classification and comparison of the different topology types. Some of them are defined in Solihin's [[#References|[3]]] textbook in chapter 12.&lt;br /&gt;
&lt;br /&gt;
*'''Degree''' is defined as the number of nodes that are neighbors to, or in other words, can be reached from it in one hop&lt;br /&gt;
*'''Hop count''' is the number of nodes through which a message needs to go through to get to the destination&lt;br /&gt;
*'''Diameter''' is the maximum hop count&lt;br /&gt;
*'''Path diversity''' is useful for the routing algorithm and is given by the amount of shortest paths that a topology offers between two nodes.&lt;br /&gt;
*'''Bisection width''' is the smallest number of wires you have to cut to separate the network into two halves&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Topologies can be classified as direct and indirect topologies.&lt;br /&gt;
In a direct topology, each node is connected to other nodes, which are named neighbouring nodes. Each node contains a network interface acting as a router in order to transfer information.&lt;br /&gt;
In an indirect topology, there are nodes that are no computational but act as switches to transfer the traffic among the rest of the nodes, including other switches. It is called indirect because packets are switched through specific elements that are not part of the computational nodes themselves.&lt;br /&gt;
&lt;br /&gt;
An example of direct topologies is 2-D Mesh. An example of indirect topology is Flattened Butterfly.  &lt;br /&gt;
&lt;br /&gt;
There are many different topologies that could be introduced in this section. Some of the missing topologies include but are not limited to:&lt;br /&gt;
&lt;br /&gt;
* Hypercube&lt;br /&gt;
* Shuffle-exchange&lt;br /&gt;
* Torus&lt;br /&gt;
* Trees&lt;br /&gt;
&lt;br /&gt;
They are just cited here for completion, related information can be found at [http://www.cs.cf.ac.uk/Parallel/Year2/section5.html Interconnection Networks]&lt;br /&gt;
&lt;br /&gt;
=== 2-D Mesh ===&lt;br /&gt;
[[File:Mesh.png|thumb|c|right|upright=0.75|2D Mesh]]This has been a very popular topology due to its simple design and low layout and router complexity. It is often described as a k-ary n-cube , where k is the number of nodes on each dimension, and n is the number of dimensions. For example, a 4-ary 2-cube is a 4x4 2D mesh.&lt;br /&gt;
Another advantage is that this topology is similar to the physical die layout, making it more suitable to implement in tiled architectures. For reference, the combination of the switch and a processor is named ''tile''.&lt;br /&gt;
&lt;br /&gt;
But not everything are advantages in this topology. One of the drawbacks of 2D Meshes is that the degree of the nodes along the edges is lower than the degree of the central nodes. This makes the 2D Mesh asymmetrical along the edges, therefore from the networking perspective, there is less demand for edge channels than for central channels.&lt;br /&gt;
&lt;br /&gt;
Jerger and Peh [[#References|[2]]], provide the following information on parameters for a mesh as defined as a k-ary n-cube:&lt;br /&gt;
*the switch degree for a 2D mesh would be 4, as its network requires two channels in each dimension or 2n, although some ports on the edge will be unused.&lt;br /&gt;
*average minimum hop count: &lt;br /&gt;
:{| {{table}}&lt;br /&gt;
| nk/3|| ||k even&lt;br /&gt;
|-&lt;br /&gt;
| n(k/3-1/3k)|| ||k odd&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
*the channel load across the bisection of a mesh under uniform random traffic with an even k is k/4&lt;br /&gt;
*meshes provide diversity of paths for routing messages&lt;br /&gt;
&lt;br /&gt;
=== Concentration Mesh ===&lt;br /&gt;
[[File:Concentratedmesh.png|thumb|c|right|upright=0.75|Concentration Mesh]] This is an evolution of the mesh topology. There is no real need to have a 1:1 relationship between the number of cores and the number of switches/routers. The Concentration mesh reduces the ratio to 1:4, i.e. each router serves four computing nodes. &lt;br /&gt;
&lt;br /&gt;
The advantage over the simple mesh is the decrease in the average hop count. This is important in terms of scaling the solution. But it is not as scalable as it could seem, as its degree is confined to the crossbar complexity [[#References|[1]]]&lt;br /&gt;
&lt;br /&gt;
The reduction in the ratio introduces a lower bisection channel count, but it can be avoided by introducing express channels, as demonstrated in [[#References|[4]]].&lt;br /&gt;
&lt;br /&gt;
Another drawback is that the port bandwidth can become a bottleneck in periods of high traffic.&lt;br /&gt;
&lt;br /&gt;
=== Flattened Butterfly ===&lt;br /&gt;
[[File:Flbfly.png|thumb|c|right|upright=0.75|Flattened butterfly]]A butterfly topology is often described as a k-ary n-fly, which implies k&amp;lt;sup&amp;gt;n&amp;lt;/sup&amp;gt; network nodes with n stages of k&amp;lt;sup&amp;gt;n−1&amp;lt;/sup&amp;gt; k × k intermediate routing nodes. The degree of each intermediate router is 2k.  &lt;br /&gt;
The ﬂattened butterﬂy is made by ﬂattening (i.e. combining) the routers in each row of a butterﬂy topology while preserving the inter-router connections. It does non-minimal routing for load balancing improvement in the network.&lt;br /&gt;
Some advantages are that the maximum distance between nodes is two hops and it has lower latency and better throughput than that of the mesh topology.&lt;br /&gt;
For the disadvantages, it has high channel count (k&amp;lt;sup&amp;gt;2&amp;lt;/sup&amp;gt;/2 per row/column), low channel utilization, and increased control complexity.&lt;br /&gt;
The flattened butterfly offers the benefits of a tree (less constraints on root-level bandwidth [Solihin 367]) as well as the ability to actually be mapped to a substrate, but because of node concentration&amp;lt;ref name=&amp;quot;GrotKeckler&amp;quot;/&amp;gt; the number of channels required for high scalability is cost- and validation-prohibitive.&lt;br /&gt;
&lt;br /&gt;
===Crossbar Switch===&lt;br /&gt;
A crossbar switch topology uses a bus arrangement with the bus lines physically perpendicular to each other and whose intersections are connected or disconnected with a switch. In the case of [http://en.wikipedia.org/wiki/Multi-core_(computing) CMPs], this switch is a transistor or, depending on the desired characteristics of the system, a programmable fuse. Due to their ability to be [http://en.wikipedia.org/wiki/Multistage_interconnection_networks multi-staged]&amp;lt;ref name=&amp;quot;wikicrossbarsemi&amp;quot;&amp;gt;&amp;quot;[http://en.wikipedia.org/wiki/Crossbar_switch#Semiconductor Crossbar switch].&amp;quot; Wikipedia. Last accessed April 24, 2012.&amp;lt;/ref&amp;gt;, these topologies lend themselves to being used for memory in large-scale systems. The IBM Cyclops64 architecture is an example of the implementation of this architecture&amp;lt;ref name=&amp;quot;cyclops64&amp;quot;&amp;gt;Zhang, Ying Ping. &amp;quot;[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1639301 A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture]. April 2006. IEEE Xplore.&amp;lt;/ref&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
=== Multidrop Express Channels (MECS) ===&lt;br /&gt;
[[File:Mecs.png|thumb|c|right|upright=0.75|MECS]] Multidrop Express Channels was proposed in [[#References|[1]]] by Grot and Keckler. Their motivation was that performance and scalability should be obtained by managing wiring. &lt;br /&gt;
Multidrop Express Channels is defined by its authors as a &amp;quot;one to-many communication fabric that enables a high degree of connectivity in a bandwidth-efﬁcient manner.&amp;quot;  Based on point-to-point unidirectional links. This makes for a high degree of connectivity with fewer bisection channels and higher bandwidth for each channel. &lt;br /&gt;
&lt;br /&gt;
Some of the parameters calculated for MECS are:&lt;br /&gt;
*Bisection channel count per each row/column is equal to k.&lt;br /&gt;
*Network diameter (maximum hop count) is two.&lt;br /&gt;
*The number of nodes accessible through each channel ranges from 1 to k − 1.&lt;br /&gt;
*A node has 1 output port per direction&lt;br /&gt;
*The input port count is 2(k − 1)&lt;br /&gt;
&lt;br /&gt;
The low channel count and the high degree of connectivity provided by each channel increase per channel bandwidth and wire utilization. At the same time, the design minimizes the serialization delay. It presents low network latencies due to its low diameter.&lt;br /&gt;
&lt;br /&gt;
=== Comparison of topologies ===&lt;br /&gt;
This data is taken from the analysis done in [[#References|[1]]]. &lt;br /&gt;
&lt;br /&gt;
[[File:Topologycomp.png|thumbnail|center|upright=5|Comparison of CMesh, Flattened Butterfly, and MECS]]&lt;br /&gt;
&lt;br /&gt;
The information in this table compares three of the topologies described above for two combinations of k which is the network radix (nodes/dimension) and c (concentration factor, 1 being no concentration). &lt;br /&gt;
&lt;br /&gt;
Maximum hop count is 2 for flattened butterfly and MECS, whereas is directly proportional to k in the case of Concentrated Mesh, what makes flattened butterfly and MECS better solutions with less network latency.&lt;br /&gt;
&lt;br /&gt;
The bisection channels is 1 for CMesh in both cases, but it gets doubled and even quadrupled between MECS and flattened butterfly. &lt;br /&gt;
&lt;br /&gt;
The bandwidth per channel in this example is better for CMesh and MECS, getting attenuated in the case of flattened butterfly.&lt;br /&gt;
&lt;br /&gt;
=== Examples of topologies in current NoCs ===&lt;br /&gt;
&lt;br /&gt;
==== Intel ====&lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=151 Intel Teraflops Research Chip] is made of an 8x10 mesh, and two 38-bit unidirectional links per channel. It has a bisection bandwidth of 380 GB/s, this includes data and sideband communication. There is a 5-port router inside of each of the computing nodes and the communication is carried out through message-passing. Its name comes from the one trillion mathematical calculations per second (1 Teraflops) of performance, accomplished with the 80 simple cores with each containing 2 floating point units and all of this consuming only 62 watts (less than many other processors).&lt;br /&gt;
 &lt;br /&gt;
The [http://techresearch.intel.com/ProjectDetails.aspx?Id=1 Single-Chip Cloud Computer] contains a 24-router mesh network with 256 GB/s bisection bandwidth. This design contains 48 fully functional cores and consumes only 25 watts. This newer model is more complete than the Teraflops Research model. It is full programmable and used for research by academia and private companies.&lt;br /&gt;
&lt;br /&gt;
==== Tilera ====&lt;br /&gt;
&lt;br /&gt;
The [http://www.tilera.com/products/processors Tilera TileGx, TilePro, and Tile64] use the Tilera’s iMesh™ on-chip network. The iMesh™ consists of five 8x8 independent mesh networks with two 32-bit unidirectional links per channel. It provides a bisection bandwidth of 320GB/s.&lt;br /&gt;
&lt;br /&gt;
The tiles that conform the Tilera designs contain a complete processor with L1 and L2 caches. And each one can run an operating system in an independent manner or several tiles can run, together as a whole, an operating system like SMP Linux, for example.&lt;br /&gt;
&lt;br /&gt;
==== ST Microelectronics ====&lt;br /&gt;
[[File:Spidergon.png|thumb|c|right|upright=1.5|Example of Spidergon design]]&lt;br /&gt;
ST Microelectronics created the Spidergon design for the STNoC [[#References|[5]]]. &lt;br /&gt;
&lt;br /&gt;
The Spidergon is a pseudo-regular topology with a design that is composed of three building blocks: network interface, router, and physical link. These building blocks make the design ready to be tailored to the needs of the application. Each router building block has a degree of 3.&lt;br /&gt;
&lt;br /&gt;
The 3 building blocks can be used to create the specific design needed, with the input/output ports that the application requires. The blocks can be configured and stored in a library for creating the design. In the picture on the right, the example contains 2 of the building blocks (router and network interface) and a third undisclosed block.&lt;br /&gt;
&lt;br /&gt;
==== IBM ====&lt;br /&gt;
The IBM [http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html Cell] project uses an interconnect with four unidirectional 16B-wide data rings, two in each direction. The name of the interconnect is the Element Interconnect Bus (EIB) and allows for communication among the different components of the Cell, among them and with the external I/O. The total network bisection bandwidth is 307.2 GB/s. &lt;br /&gt;
&lt;br /&gt;
As a curiosity, the Cell processor was jointly developed with Sony and Toshiba, and is [http://en.wikipedia.org/wiki/Cell_(microprocessor) used] in the [http://news.cnet.com/PlayStation-3-chip-has-split-personality/2100-1043_3-5566340.html?tag=nl Sony PlayStation 3]. The Cell consists of a PowerPC core which manages eight synergistic processing engines (SPEs) that can be used for floating-point calculations. These calculations provide the engine for better gaming systems.&lt;br /&gt;
&lt;br /&gt;
== Routing ==&lt;br /&gt;
&lt;br /&gt;
There are a variety of routing protocols that can be used for [http://en.wikipedia.org/wiki/System_on_a_chip SoC's], each having different advantages and disadvantages.  They can be broadly classified in several different ways.&lt;br /&gt;
&lt;br /&gt;
===General Routing Schemes===&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Store_and_forward Store and forward routing]==== &lt;br /&gt;
This routing scheme has been used since the early days of telecommunications.  It requires that the entire message be received at a node prior before it is propagated to the next node.  This protocol suffers from a high storage requirement and high latency, due to the need to completely buffer a message before forwarding it.[[#References|[7]]]  This approach can be quite effective when the average packet size is small in comparison with the channel widths.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Cut-through_switching Cut-Through routing] or [http://en.wikipedia.org/wiki/Wormhole_switching Worm Hole routing]====&lt;br /&gt;
These two protocols uses the switch to examine the flit header, decide where to send the message, and then start forwarding it immediately.  True cut-through routing lets the tail continue when the head is blocked, stacking message packets into a single switch (which requires a buffer large enough to hold the largest packet).  In worm hole routing, when the head of the message is blocked the message stays strung out over multiple nodes in the network, potentially blocking other messages (however, this needs only enough buffer space to store the piece of the packet that is sent between switches).  Using a cut-through protocol lowers latency but can suffer from packet corruption and must implement a scheme to handle this.[[#References|[7]]]&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Deterministic_routing Deterministic routing]====&lt;br /&gt;
This describes a routing scheme where, if we are given a pair of nodes, the same path will always be used between those nodes.&lt;br /&gt;
&lt;br /&gt;
====[http://en.wikipedia.org/wiki/Adaptive_routing Adaptive routing]====&lt;br /&gt;
This is a routing scheme where the underlying routers may alter the path of packet flow in response to system conditions or other algorithm criteria.  Adaptive routing is intended to provide as many routes as possible to reach the destination.&lt;br /&gt;
&lt;br /&gt;
====Deadlock and Livelock====&lt;br /&gt;
&lt;br /&gt;
Deadlock and livelock are two separate situations that may occur during routing, both resulting in packets never reaching their destination.  They are defined as follows:&lt;br /&gt;
&lt;br /&gt;
''' Deadlock ''' is defined as a situation where there are activities (e.g., messages) each waiting for another to finish something.[[#References|[8]]] Since a waiting activity cannot finish, the messages are deadlocked.  This is analogous to the [http://en.wikipedia.org/wiki/Dining_philosophers_problem Dining Philosophers Problem], each deadlocked message is waiting on the result of another deadlocked message, and none are able to reach their destination.&lt;br /&gt;
&lt;br /&gt;
''' Livelock ''' is defined as a situation where a message can move from node to node but will never reach their destination node.[[#References|[8]]]  This is similar to deadlock in that the message never reaches its destination, but the message is still able to travel through portions of the network, making hops but never reaching its target.  This is analogous to a process spinning while waiting, the process itself is doing meaningless work but it is still active.  &lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
===Routing Protocols in SoC's===&lt;br /&gt;
&lt;br /&gt;
The specific routing protocols below are built using the ideas from the classes of protocols previously described.&lt;br /&gt;
&lt;br /&gt;
==== Source Routing ====&lt;br /&gt;
&lt;br /&gt;
The source node partially or totally computes the path a packet will take through the network and stores the information in the packet header.  The extra route information is sent in each packet, inflating their size.&lt;br /&gt;
&lt;br /&gt;
==== Distributed Routing ====&lt;br /&gt;
&lt;br /&gt;
Each switch in the network computes the next route that will be taking towards the destination.  The packet header contains only the destination information, reducing its size compared to source routing.  This approach requires routing tables to be present to direct the packet from node to node, which does not scale well when the number of nodes increases.&lt;br /&gt;
&lt;br /&gt;
==== Logic Based Distributed Routing (LBDR) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, routing is achieved by each router knowing its position in the architecture and being able to determine what direction it is from the destination of the packet.  It is most commonly used in 2D meshes, but it can be applied to other topologies as well.[[#References|[7]]]  Using this position information, it is possible to route the packet based on a small number of bits and a few logic gates per router, which saves over a table or a buffer.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are several variations of LBDR&lt;br /&gt;
&lt;br /&gt;
''' LBDRe ''' - This variation models up to two future hops before deciding where to send the packet next.    &lt;br /&gt;
&lt;br /&gt;
''' uLBDR (Universal LBDR) ''' - This variation adds packet multicast support to the protocol.&lt;br /&gt;
&lt;br /&gt;
''' bLBDR ''' - This variation adds the ability to broadcast messages to only certain regions (segments) of the network.&lt;br /&gt;
&lt;br /&gt;
==== Bufferless Deflection Routing (BLESS protocol) ====&lt;br /&gt;
&lt;br /&gt;
In this protocol, each flit of a packet is routed independently of every other flit through the network, and different flits from the same packet may take different paths.  Any contention between multiple flits results in one flit taking the desired path and the other flit being “deflected” to some other router.  This may result in undesirable routing, but the packets will eventually reach the destination.[[#References|[10]]]  This type of routing is feasible on every network topology that satisfies the following two constraints: Every router has at least the same number of output ports as the number of its input ports, and every router is reachable from every other router.[[#References|[10]]]  &lt;br /&gt;
&lt;br /&gt;
==== CHIPPER (Cheap-Interconnect Partially Permuting Router) ====&lt;br /&gt;
&lt;br /&gt;
This protocol was designed to address inefficient port allocation in the BLESS protocol.  A permutation network directs deflected flits to free output ports.  By limiting the requirements so that only that the highest-priority flit obtains its request, we can prevent livelock.  In the case of contention, arbitration logic chooses a winning flit.  It does this by choosing a single packet, and prioritize that packet globally above all other packets for long enough that its delivery is ensured.  Every packet in the system eventually receives this special status, so every packet is eventually delivered (the Golden Packet scheme).[[#References|[11]]]&lt;br /&gt;
&lt;br /&gt;
==== Dimension-order Routing ====&lt;br /&gt;
&lt;br /&gt;
This protocol is a deterministic strategy for multidimensional networks.  Each direction is chosen in order and routed completely before switching to the next direction.  For example, in a 2D mesh, dimension order routing could be implemented by completely routing the packet in the X-dimension before beginning to route in the Y-dimension.  This is extensible to higher order connections as well, for example, hypercubes can be routed in dimension order by routing packets along the dimensions in the order of different bit positions of the source and destination address, one bit position at a time.[[#References|[9]]]&lt;br /&gt;
&lt;br /&gt;
== Lines of Research ==&lt;br /&gt;
From NoCs perspective, there are many lines of research besides the abundant of technologies of the commercial designs. Some of them are presented in this section.&lt;br /&gt;
&lt;br /&gt;
=== Optical on-chip interconnects ===&lt;br /&gt;
IBM has been performing extensive research on photonic layer inside of the CMP used not only for connecting several cores, but also for routing traffic: [http://researcher.ibm.com/view_project.php?id=2757 Silicon Integrated Nanophotonics.] This technology was actually used in the IBM Cell chip that was mentioned in above sections. The main advantages are reliability and power efficiency.&lt;br /&gt;
&lt;br /&gt;
This [http://www.research.ibm.com/photonics/publications/ecoc_tutorial_2008.pdf tutorial] explains some differences between electronics and photonics in terms of power consumption, the more efficient is the computing from power's perspective, the more FLOPs per Watt:&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Electronics'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Photonics'''&lt;br /&gt;
|-&lt;br /&gt;
| Electronic network ~500W||Optic network &amp;lt;80W&lt;br /&gt;
|-&lt;br /&gt;
| power = bandwidth x length||power does not depend on bitrate nor length&lt;br /&gt;
|-&lt;br /&gt;
| buffer on chip that rx and re-tx every bit at every switch||rx (modulate) data once, without having to re-tx&lt;br /&gt;
|-&lt;br /&gt;
| ||switching fabric has almost no power dissipation&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
In academia, there are articles like [[#References|[6]]] which proposes a new topology created for optical on-chip interconnects. They refer to previous papers that cite adaptations of well-known electronic designs, but highlight the need to provide a &amp;quot;scalable all-optical NoC, referred to as 2D-HERT, with passive routing of optical data streams based on their wavelengths.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Reconfigurable NoC ===&lt;br /&gt;
Another field of study is the Software reconfigurable on-chip networks. They are commonly based on the 2D mesh topology. The main idea is to be able to reconfigure the NoC depending on the application and during run-time to react to congestion problems or, in general, adapt to the traffic load. &lt;br /&gt;
&lt;br /&gt;
In [[#References|[12]]], the authors propose a design based on the properties of the  [http://en.wikipedia.org/wiki/Field-programmable_gate_array field-programmable gate array (FPGA)]. It can dynamically implement circuit-switching channels, perform variations in the topology, and reconfigure routing tables. One of the main drawbacks is the overhead that this reconfiguration introduces, although it is designed to minimize it.&lt;br /&gt;
&lt;br /&gt;
=== Bio NoC ===&lt;br /&gt;
Bio NoC or ANoC (Autonomic Network-on-Chip) is based on the concept of the human autonomic nervous system or the human biological immune system. The intention is to provide a NoC with self-organization, self-configuration, and self-healing to dynamically control networking functions. &lt;br /&gt;
&lt;br /&gt;
[[#References|[13]]] presents a collection of chapters/articles from emerging research issues in the ANoC field of application.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
 &lt;br /&gt;
[1] Mirza-Aghatabar, M.; Koohi, S.; Hessabi, S.; Pedram, M.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4341445 &amp;quot;An Empirical Investigation of Mesh and Torus NoC Topologies Under Different Routing Algorithms and Traffic Models,&amp;quot;] Digital System Design Architectures, Methods and Tools, 2007. DSD 2007. 10th Euromicro Conference on , vol., no., pp.19-26, 29-31 Aug. 2007&lt;br /&gt;
&lt;br /&gt;
[2] Ying Ping Zhang; Taikyeong Jeong; Fei Chen; Haiping Wu; Nitzsche, R.; Gao, G.R.; , [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1639301 &amp;quot;A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture,&amp;quot;] Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International , vol., no., pp. 10 pp., 25-29 April 2006&lt;br /&gt;
&lt;br /&gt;
[3] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. [http://www.eecg.toronto.edu/~enright/tilera.pdf On-Chip Interconnection Architecture of the Tile Processor.] IEEE Micro 27, 5 (September 2007), 15-31.&lt;br /&gt;
&lt;br /&gt;
[4] D. N. Jayasimha, B. Zafar, Y. Hoskote. [http://blogs.intel.com/wp-content/mt-content/com/research/terascale/ODI_why-different.pdf On-chip interconnection networks: why they are different and how to compare them.] Technical Report, Intel Corp, 2006&lt;br /&gt;
&lt;br /&gt;
[5] John Kim, James Balfour, and William Dally. [http://cva.stanford.edu/publications/2007/MICRO_FBFLY.pdf Flattened butterfly topology for on-chip networks.] In Proceedings of the 40th International Symposium on Microarchitecture, pages 172–182, December 2007.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
[1] B. Grot and S. W. Keckler. [http://www.cs.utexas.edu/~bgrot/docs/CMP-MSI_08.pdf Scalable on-chip interconnect topologies.] 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects, 2008.&lt;br /&gt;
&lt;br /&gt;
[2] Natalie Enright Jerger and Li-Shiuan Peh. [http://www.morganclaypool.com/doi/abs/10.2200/S00209ED1V01Y200907CAC008?journalCode=cac On-Chip Networks.] Synthesis Lectures on Computer Architecture. 2009, 141 pages. Morgan and Claypool Publishers.&lt;br /&gt;
&lt;br /&gt;
[3] Yan Solihin. (2008). [http://www.cesr.ncsu.edu/solihin/Main.html Fundamentals of parallel computer architecture.] Solihin Pub.&lt;br /&gt;
&lt;br /&gt;
[4] James Balfour and William J. Dally. 2006. [http://www.cs.berkeley.edu.prox.lib.ncsu.edu/~kubitron/courses/cs258-S08/handouts/papers/jbalfour_ICS.pdf Design tradeoffs for tiled CMP on-chip networks.] In Proceedings of the 20th annual international conference on Supercomputing (ICS '06). ACM, New York, NY, USA, 187-198.&lt;br /&gt;
&lt;br /&gt;
[5] Dubois, F.; Cano, J.; Coppola, M.; Flich, J.; Petrot, F.; , [http://www.comcas.eu/publications/Spidergon_STNoC_Design.pdf Spidergon STNoC design flow,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.267-268, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[6] Koohi, S.; Abdollahi, M.; Hessabi, S.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=5948588&amp;amp;isnumber=5948548 All-optical wavelength-routed NoC based on a novel hierarchical topology,] Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on , vol., no., pp.97-104, 1-4 May 2011&lt;br /&gt;
&lt;br /&gt;
[7] Flich, J.; Duato, J.;, [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=4407676&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4407676 Logic-Based Distributed Routing for NoCs,] 2008 Computer Architecture Letters, vol. 7, no. 1, pp.13-16, Jan 2008&lt;br /&gt;
&lt;br /&gt;
[8] Wu, J.; [http://www.cse.fau.edu/~jie/research/publications/Publication_files/ieeetc0309.pdf A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model,] 2003 IEEE Transactions on Computers, Vol. 52, No. 9, pp.1154-1169, Sept 2003&lt;br /&gt;
&lt;br /&gt;
[9] Veselovsky, G.; Batovski, D.A.; [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1183584&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1183584 A study of the permutation capability of a binary hypercube under deterministic dimension-order routing,] 2003 Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, vol., no., pp.173-177, 5-7 Feb. 2003&lt;br /&gt;
&lt;br /&gt;
[10] Moscibroda, T; Mutlu, O.; [http://research.microsoft.com/pubs/80241/isca_2009-bless.pdf A Case for Bufferless Routing in On-Chip Networks,] ACM SIGARCH Computer Architecture News, Volume 37 Issue 3, June 2009&lt;br /&gt;
&lt;br /&gt;
[11] Fallin, C.; Craik, C.; Mutlu, O.; [http://www.ece.cmu.edu/~safari/pubs/chipper_hpca2011.pdf CHIPPER: A Low-complexity Bufferless Deflection Router,] Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA 2011), San Antonio, TX, February 2011.&lt;br /&gt;
&lt;br /&gt;
[12] V. Rana, et al., [http://infoscience.epfl.ch/record/130661/files/paperM2B-VLSI-SoC2008%5b1%5d.pdf A Reconfigurable Network-on-Chip Architecture for Optimal Multi-Processor SoC Communication,] in VLSI-SoC, 2009.&lt;br /&gt;
&lt;br /&gt;
[13] Cong-Vinh, P. (December 2011). [http://www.crcpress.com/product/isbn/9781439829110 Autonomic networking-on-chip: Bio-inspired specification, development, and verification.] CRC Press.&lt;br /&gt;
&lt;br /&gt;
[14] S. Kumar, et al., [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=1016885&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1016885 A Network on Chip Architecture and Design Methodology,] VLSI on Annual Symposium, IEEE Computer Society ISVLSI 2002.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
1. Advantage of 2-D Mesh&lt;br /&gt;
&lt;br /&gt;
a) simple design&lt;br /&gt;
&lt;br /&gt;
b) cumbersome design&lt;br /&gt;
&lt;br /&gt;
c) degree is the same for all nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
2. Diameter is&lt;br /&gt;
&lt;br /&gt;
a) minimum hop count&lt;br /&gt;
&lt;br /&gt;
b) maximum hop count&lt;br /&gt;
&lt;br /&gt;
c) number of neighbors &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
3. SOC stands for&lt;br /&gt;
&lt;br /&gt;
a) System of Chips&lt;br /&gt;
&lt;br /&gt;
b) Switch of Cores&lt;br /&gt;
&lt;br /&gt;
c) System on a Chip&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4. In a direct topology,&lt;br /&gt;
&lt;br /&gt;
a) each node contains a network interface acting as a router in order to transfer information&lt;br /&gt;
&lt;br /&gt;
b) there are nodes that act as routers&lt;br /&gt;
&lt;br /&gt;
c) only one node is a computational nodes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5. The Single-Chip Cloud Computer contains &lt;br /&gt;
&lt;br /&gt;
a) an 8x10 mesh&lt;br /&gt;
&lt;br /&gt;
b) a 64-router mesh network&lt;br /&gt;
&lt;br /&gt;
c) a 24-router mesh network&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
6. A deterministic routing scheme uses algorithms to determine the most advantageous path to the target node.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
7. Livelock is necessary to maintain coherence in routing protocols.&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
8. Dimension Order routing&lt;br /&gt;
&lt;br /&gt;
a) is only possible with 2D mesh-based topologies.&lt;br /&gt;
&lt;br /&gt;
b) attempts to route all packets in one dimension before starting another.&lt;br /&gt;
&lt;br /&gt;
c) uses routing tables to find the packet destination.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
9. Source routing&lt;br /&gt;
&lt;br /&gt;
a) includes information in the packet about the destination node&lt;br /&gt;
&lt;br /&gt;
b) uses routing information calculated by the sending node&lt;br /&gt;
&lt;br /&gt;
c) all of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
10. Store and forward routing&lt;br /&gt;
&lt;br /&gt;
a) requires the entire message to be broken into regular sized pieces and sent over the network&lt;br /&gt;
&lt;br /&gt;
b) is an optimal routing protocol&lt;br /&gt;
&lt;br /&gt;
c) buffers the entire message in each node along the route before sending it to the next node&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=74915</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=74915"/>
		<updated>2013-04-17T17:56:53Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/1a_ag]]&lt;br /&gt;
* Chapter 3a [[CSC/ECE_506_Spring_2013/3a_bs]]&lt;br /&gt;
* Chapter 6a [[CSC/ECE_506_Spring_2013/6a_cs]]&lt;br /&gt;
* Chapter 5a [[CSC/ECE_506_Spring_2013/5a_ks]]&lt;br /&gt;
* Chapter 8a [[CSC/ECE_506_Spring_2013/8a_an]]&lt;br /&gt;
* Chapter 7a [[CSC/ECE_506_Spring_2013/7a_bs]]&lt;br /&gt;
* Chapter 8b [[CSC/ECE_506_Spring_2013/8b_ap]]&lt;br /&gt;
* Chpater 8c [[CSC/ECE_506_Spring_2013/8c_da]]&lt;br /&gt;
* Chpater 10a [[CSC/ECE_506_Spring_2013/10a_os]]&lt;br /&gt;
* Chapter 10c [[CSC/ECE_506_Spring_2013/10c_ks]]&lt;br /&gt;
* Chapter 12a [[CSC/ECE_506_Spring_2013/12a_cm]]&lt;br /&gt;
* Chapter 12b [[CSC/ECE_506_Spring_2013/12b_dj]]&lt;br /&gt;
* Chapter 12b [[CSC/ECE_506_Spring_2013/12b_sl]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/9b_sc]]&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=73818</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=73818"/>
		<updated>2013-03-01T21:52:50Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/1a_ag]]&lt;br /&gt;
* Chapter 3a [[CSC/ECE_506_Spring_2013/3a_bs]]&lt;br /&gt;
* Chapter 6a [[CSC/ECE_506_Spring_2013/6a_cs]]&lt;br /&gt;
* Chapter 5a [[CSC/ECE_506_Spring_2013/5a_ks]]&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=73817</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=73817"/>
		<updated>2013-02-28T21:14:39Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/2a_ss]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/1a_ag]]&lt;br /&gt;
* Chapter 3a [[CSC/ECE_506_Spring_2013/3a_bs]]&lt;br /&gt;
* Chapter 6a [[CSC/ECE_506_Spring_2013/6a_cs]]&lt;br /&gt;
* Chapter 5a [[CSC/ECE_506_Spring_2013/5a_ks]]&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72704</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72704"/>
		<updated>2013-02-13T23:02:50Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines. Example of SAS programming is clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Some of the issues include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems started to flourish in the 80s. The increasing performance in processors and network connectivity offered the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This was where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduced ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Distributed shared memory is an architectural approach designed to overcome the scaling limitations of symmetric shared memory multiprocessors while retaining a shared memory model for communication and programming. A distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data?&lt;br /&gt;
Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM - A look up for data should first determine if the requested data is in the local memory, if not the system must bring the data to local memory. This can be executed in software or hardware or both. Choice of implementations depends on the price/performance trade offs.&lt;br /&gt;
c)Memory consistency model - The behavior of the memory with respect to read and write operations from multiple processors has to be dealt with appropriate memory consistency models.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Cache coherence, arises when different processors cache and update values of the same memory location. Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.Thus, the system allows multiple copies of a memory location to exist when&lt;br /&gt;
it is being read, but only one copy when it is being written. &lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor. Since the directory tracks which caches have copies&lt;br /&gt;
of any given memory block, a coherence protocol can use&lt;br /&gt;
the directory to maintain a consistent view of memory. A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block. A simple cache coherence&lt;br /&gt;
protocol can operate with three states for each cache block. These state are Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for Shared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72703</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72703"/>
		<updated>2013-02-13T23:01:19Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines. Example of SAS programming is clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Some of the issues include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems started to flourish in the 80s. The increasing performance in processors and network connectivity offered the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This was where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduced ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Distributed shared memory is an architectural approach designed to overcome the scaling limitations of symmetric shared memory multiprocessors while retaining a shared memory model for communication and programming. A distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data?&lt;br /&gt;
 Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM - A look up for data should first determine if the requested data is in the local memory, if not the system must bring the data to local memory. This can be executed in software or hardware or both. Choice of implementations depends on the price/performance trade offs.&lt;br /&gt;
c)Memory consistency model - The behavior of the memory with respect to read and write operations from multiple processors has to be dealt with appropriate memory consistency models.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Cache coherence, arises when different processors cache and update values of the same memory location. Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.Thus, the system allows multiple copies of a memory location to exist when&lt;br /&gt;
it is being read, but only one copy when it is being written. &lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor. Since the directory tracks which caches have copies&lt;br /&gt;
of any given memory block, a coherence protocol can use&lt;br /&gt;
the directory to maintain a consistent view of memory. A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block. A simple cache coherence&lt;br /&gt;
protocol can operate with three states for each cache block. These state are Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for Shared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72702</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72702"/>
		<updated>2013-02-13T22:27:36Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines. Example of SAS programming is clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Some of the issues include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems started to flourish in the 80s. The increasing performance in processors and network connectivity offered the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This was where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduced ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Distributed shared memory is an architectural approach designed to overcome the scaling limitations of symmetric shared memory multiprocessors while retaining a shared memory model for communication and programming. A distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data&lt;br /&gt;
Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM-A look up for data should first determine if the requested data is in the local memory, if not the system must bring the data to local memory. . This can be executed in software or hardware or both. Choice of implementations depends on the price/performance trade offs&lt;br /&gt;
c)Memory consistency model - The behavior of the memory with respect to read and write operations from multiple processors has to be dealt with appropriate memory consistency models.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Cache coherence, arises when different processors cache and update values of the same memory location. Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.Thus, the system allows multiple copies of a memory location to exist when&lt;br /&gt;
it is being read, but only one copy when it is being written. &lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor. Since the directory tracks which caches have copies&lt;br /&gt;
of any given memory block, a coherence protocol can use&lt;br /&gt;
the directory to maintain a consistent view of memory. A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block. A simple cache coherence&lt;br /&gt;
protocol can operate with three states for each cache block. These state are Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for Shared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72701</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72701"/>
		<updated>2013-02-13T22:19:47Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines. Example of SAS programming is clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Some of the issues include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems start to flourish in the 80s. The increasing performance in processors and network connectivity offers the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This is where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduces ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Distributed shared memory is an architectural approach designed to overcome the scaling limitations of symmetric shared memory multiprocessors while retaining a shared memory model for communication and programming. Generally a distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node itself contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data&lt;br /&gt;
Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM-A look up for data should first determine if the requested data is in the local memory, if not the system must bring the data to local memory. . This can be executed in software or hardware or both. Choice of implementations depends on the price/performance trade offs&lt;br /&gt;
c)Memory consistency model - The behavior of the memory with respect to read and write operations from multiple processors has to be dealt with appropriate memory consistency models.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Cache coherence, arises when different processors cache and update values of the same memory location. Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.Thus, the system allows multiple copies of a memory location to exist when&lt;br /&gt;
it is being read, but only one copy when it is being written. &lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor. Since the directory tracks which caches have copies&lt;br /&gt;
of any given memory block, a coherence protocol can use&lt;br /&gt;
the directory to maintain a consistent view of memory. A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block. A simple cache coherence&lt;br /&gt;
protocol can operate with three states for each cache block. These state are Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for Shared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72700</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72700"/>
		<updated>2013-02-13T22:13:33Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines. Example of SAS programming is clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Relevant issues that come to bear include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems start to flourish in the 80s. The increasing performance in processors and network connectivity offers the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This is where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduces ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Distributed shared memory is an architectural approach designed to overcome the scaling limitations of symmetric shared memory multiprocessors while retaining a shared memory model for communication and programming. Generally a distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node itself contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data&lt;br /&gt;
Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM-A look up for data should first determine if the requested data is in the local memory, if not the system must bring the data to local memory. . This can be executed in software or hardware or both. Choice of implementations depends on the price/performance trade offs&lt;br /&gt;
c)Memory consistency model - The behavior of the memory with respect to read and write operations from multiple processors has to be dealt with appropriate memory consistency models.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Cache coherence, arises when different processors cache and update values of the same memory location. Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.Thus, the system allows multiple copies of a memory location to exist when&lt;br /&gt;
it is being read, but only one copy when it is being written. &lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor. Since the directory tracks which caches have copies&lt;br /&gt;
of any given memory block, a coherence protocol can use&lt;br /&gt;
the directory to maintain a consistent view of memory. A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block. A simple cache coherence&lt;br /&gt;
protocol can operate with three states for each cache block. These state are Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for Shared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72699</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72699"/>
		<updated>2013-02-13T22:09:58Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines, such as clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Relevant issues that come to bear include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems start to flourish in the 80s. The increasing performance in processors and network connectivity offers the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This is where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduces ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Distributed shared memory is an architectural approach designed to overcome the scaling limitations of symmetric shared memory multiprocessors while retaining a shared memory model for communication and programming. Generally a distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node itself contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data&lt;br /&gt;
Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM-A look up for data should first determine if the requested data is in the local memory, if not the system must bring the data to local memory. . This can be executed in software or hardware or both. Choice of implementations depends on the price/performance trade offs&lt;br /&gt;
c)Memory consistency model - The behavior of the memory with respect to read and write operations from multiple processors has to be dealt with appropriate memory consistency models.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Cache coherence, arises when different processors cache and update values of the same memory location. Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.Thus, the system allows multiple copies of a memory location to exist when&lt;br /&gt;
it is being read, but only one copy when it is being written. &lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor. Since the directory tracks which caches have copies&lt;br /&gt;
of any given memory block, a coherence protocol can use&lt;br /&gt;
the directory to maintain a consistent view of memory. A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block. A simple cache coherence&lt;br /&gt;
protocol can operate with three states for each cache block. These state are Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for Shared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72425</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72425"/>
		<updated>2013-02-10T21:45:29Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines, such as clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Relevant issues that come to bear include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems start to flourish in the 80s. The increasing performance in processors and network connectivity offers the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This is where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduces ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Generally a distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g. [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node itself contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data&lt;br /&gt;
Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM-A look up for data should first determine if the requested data is in the local memory, if not the system must bring the data to local memory. . This can be executed in software or hardware or both. Choice of implementations depends on the price/performance trade offs&lt;br /&gt;
c)Memory consistency model - The behavior of the memory with respect to read and write operations from multiple processors has to be dealt with appropriate memory consistency models.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.  These protocols did not scale to DSM machines and different approaches were necessary.&lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor.  A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block.  These states usually minimally include Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for SHared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72419</id>
		<title>CSC/ECE 506 Spring 2013/4a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/4a_ss&amp;diff=72419"/>
		<updated>2013-02-10T21:26:58Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: Created page with &amp;quot;SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system  == SAS programming on distributed-memory machines == [http://en...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines, such as clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Relevant issues that come to bear include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems start to flourish in the 80s. The increasing performance in processors and network connectivity offers the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This is where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduces ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory.  Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Generally a distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g. [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node itself contains a local memory, which maps partially to the distributed address space.  &lt;br /&gt;
Regardless of the system topology(bus,ring,mesh), a specific interconnection in each cluster must connect it to the system. Information about states and current locations of particular data blocks usually resides in a system table or directory. Directory organization varies from full map storage to different dynamic organizations.&lt;br /&gt;
There are 3 issues while accessing the data in the DSM address space while keeping the data consistent &lt;br /&gt;
a)Which DSM algorithm to use to access data&lt;br /&gt;
Commonly used strategies are replication and migration. Replication allows multiple copies of same data items to reside in different local memories. Migration implies that only a single copy of a data item exists at any one time, so the data item must be moved to the requesting site for exclusive use.  &lt;br /&gt;
b)Implementation level of the DSM &lt;br /&gt;
c)Memory consistency model&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.  These protocols did not scale to DSM machines and different approaches were necessary.&lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor.  A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block.  These states usually minimally include Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for SHared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72418</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=72418"/>
		<updated>2013-02-10T20:43:19Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring_2012/2a va ]]&lt;br /&gt;
*Chapter 2b [[CSC/ECE 506 Spring 2012/ch2b cm | CSC/ECE 506 Spring 2012/ch2b cm]]&lt;br /&gt;
*Chapter 2b [[ECE506_CSC/ECE_506_Spring_2012/2b_az | CSC/ECE 506 Spring 2012/2b az - Data-Parallel Processing with the AMD HD 6900 Series Graphics Processing Unit]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]&lt;br /&gt;
*Chapter 5a [[ CSC/ECE 506 Spring 2012/ch5a ja | CSC/ECE 506 Spring 2012/ch5a ja ]]&lt;br /&gt;
*Chapter 9a [[CSC/ECE 506 Spring 2012/ch9a cm | CSC/ECE 506 Spring 2012/ch9a cm]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10 [[CSC/ECE 506 Spring 2012/ch10 sj | CSC/ECE 506 Spring 2012/ch10 sj]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;br /&gt;
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]&lt;br /&gt;
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a ry]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c dm]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1c cl]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/1a mw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3a yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/7b yw]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/3b sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/4b rs]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/6b am]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/8a cj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a dr]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10a jp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/9a ms]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/10b sr]]&lt;br /&gt;
*Chapter 11a [[ECE506_CSC/ECE_506_Spring_2012/11a_az | CSC/ECE 506 Spring 2012/11a az - Performance of DSM system]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/12b jh]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a fu]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2010/8a sk]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2012/11a ht]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1b dj]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1a sp]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1d ks]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/2b so]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/1c ad]]&lt;br /&gt;
*[[CSC/ECE 506 Spring 2013/3b xz]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_aj]]&lt;br /&gt;
*[[CSC/ECE_506_Spring_2013/4a_ss]]&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/2a_ss&amp;diff=72417</id>
		<title>CSC/ECE 506 Spring 2013/2a ss</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2013/2a_ss&amp;diff=72417"/>
		<updated>2013-02-10T20:40:42Z</updated>

		<summary type="html">&lt;p&gt;Shvemuri: Created page with &amp;quot;SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system  == SAS programming on distributed-memory machines == [http://en...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]&lt;br /&gt;
&lt;br /&gt;
== SAS programming on distributed-memory machines ==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines, such as clusters of servers.  Distributed systems are groups of computers that communicate through a network and share a common work goal.  Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access in order to communicate.  Relevant issues that come to bear include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.&lt;br /&gt;
&lt;br /&gt;
=== Origins ===&lt;br /&gt;
Distributed memory systems start to flourish in the 80s. The increasing performance in processors and network connectivity offers the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This is where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.&lt;br /&gt;
&lt;br /&gt;
In 1985, Cheriton in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 &amp;quot;Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems&amp;quot;] introduces ideas for the application of shared memory techniques in Distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory with a common file namespace that could &amp;quot;decentralize the implementation of a service.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]&lt;br /&gt;
Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today.  In a message passing model, each processor's local memory can be considered as isolated from that of the rest of the system.  Processes or objects can send or receive messages in order to communicate and this can occur in a synchronous or asynchronous manner.  In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become quite challenging with many control threads.  A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory. Relevant design elements of early SAS implementations included scalability, coherence, structure and granularity.  Most early examples did not structure memory, that is the layout of shared memory was simply a linear array of words.  Some, however, structured data as objects or language types.  '''IVY''' , an early example of a DSM system, implemented shared memory as virtual memory.  The granularity, or unit share size, for IVY was in 1-Kbyte pages and the memory was unstructured.Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems. &lt;br /&gt;
&lt;br /&gt;
=== Distributed Shared Memory (DSM) ===&lt;br /&gt;
Generally a distributed system consists of a set of nodes connected by an interconnection network.  Nodes may be comprised of individual processors or a multiprocessor system (e.g. [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus.  Each node contains a local memory, which maps partially to the distributed address space. A specific interconnection controller in each node must connect it to the system.  &lt;br /&gt;
A problem when considering optimal page size is the balance between a process likely needing quick access to a large range of the shared address space, which argues for a larger page size, countered by the greater contention for individual pages that the larger page may cause amongst processes and the [http://en.wikipedia.org/wiki/False_sharing '''false sharing'''] it may lead to.  Memory coherence is another important design element consideration and semantics can be instituted that run gradations of strict to weak consistencies.  The strictest consistency guarantees that a read returns the most recently written value.  Weaker consistencies may use synchronization operations to guarantee sequential consistency.&lt;br /&gt;
&lt;br /&gt;
=== Cache-Coherent DSM ===&lt;br /&gt;
&lt;br /&gt;
Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was &lt;br /&gt;
related to its location.  These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type&lt;br /&gt;
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture.  NUMA architectures were difficult to program in due &lt;br /&gt;
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching. &lt;br /&gt;
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)&lt;br /&gt;
were invalidated.  These protocols did not scale to DSM machines and different approaches were necessary.&lt;br /&gt;
&lt;br /&gt;
Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] where an extra directory structure keeps track&lt;br /&gt;
of all blocks that have been cached by each processor.  A coherence protocol can then establish a consistent view of &lt;br /&gt;
memory by maintaining state and other information about each cached block.  These states usually minimally include Invalid,&lt;br /&gt;
Shared, and Exclusive.  Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate&lt;br /&gt;
with the cache block it describes in the physical local memory.&lt;br /&gt;
&lt;br /&gt;
=== Page Management and memory mapping in Mome ===&lt;br /&gt;
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Mome] is described by its developers as a user-level distributed shared memory.  Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.  &lt;br /&gt;
&lt;br /&gt;
===== Mome Segment creation =====&lt;br /&gt;
&lt;br /&gt;
Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes.  Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region.  Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes.  If mappings are consistent between processes, however, then pointers may be shared by them.  Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.&lt;br /&gt;
&lt;br /&gt;
===== Page Management in Mome =====&lt;br /&gt;
&lt;br /&gt;
Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node.  The page manager acts upon collections of nodes according to these characteristics for each page:  &lt;br /&gt;
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are &lt;br /&gt;
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page.  A new version of &lt;br /&gt;
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency&lt;br /&gt;
request.  &lt;br /&gt;
&lt;br /&gt;
===== Memory mapping in Mome =====&lt;br /&gt;
&lt;br /&gt;
The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node.  The DSM memory size&lt;br /&gt;
shown is 22 pages.  When a new segment is created on a node a segment descriptor is created on that node.  In this case the&lt;br /&gt;
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page.  Each block also contains&lt;br /&gt;
three DSM memory references for current, modified and next version of pages.  The memory organization state shows an &lt;br /&gt;
application with two mappings, M1 and M2, with segment offsets at 0 and 8.  The six pages of M1 are managed by segment &lt;br /&gt;
descriptor blocks 0 to 5.  The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated &lt;br /&gt;
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,&lt;br /&gt;
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer.  The communication&lt;br /&gt;
layer manages incoming messages from other nodes. &lt;br /&gt;
&lt;br /&gt;
[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Node Communication ===&lt;br /&gt;
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm &lt;br /&gt;
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as &lt;br /&gt;
its own local disk storage.  [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled &lt;br /&gt;
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point &lt;br /&gt;
communication at the node level is supported through message passing, and the specific mechanism for communication is&lt;br /&gt;
agreed to by all nodes. &lt;br /&gt;
&lt;br /&gt;
Yoon describes a DSM system that generates a shared virtual memory on a per job basis.  A '''configurable shared virtual address space'''&lt;br /&gt;
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates&lt;br /&gt;
an information table in its memory:&lt;br /&gt;
&lt;br /&gt;
                                     ''JOB_INFORMATION {''&lt;br /&gt;
                                         ''status;''&lt;br /&gt;
                                         ''number_of_tasks;''&lt;br /&gt;
                                         ''number_of_completed_tasks;''&lt;br /&gt;
                                         ''*member_list;''                    /*pointer to first member*/&lt;br /&gt;
                                         ''number_of_members;''&lt;br /&gt;
                                         ''IO_server;''&lt;br /&gt;
                                     ''}''&lt;br /&gt;
&lt;br /&gt;
The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through&lt;br /&gt;
a task distribution process during address space assignment.  All tasks associated with the program are tagged with&lt;br /&gt;
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system.  The&lt;br /&gt;
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new&lt;br /&gt;
CSVS to all other members.  Subspace assignment for the SAS model ensues under the specific ''job_id''.&lt;br /&gt;
&lt;br /&gt;
The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION'' &lt;br /&gt;
table which is consulted to identify the default manager when a page fault occurs.  When a page fault does occur, the MMU&lt;br /&gt;
locates the default manager and handles the fault normally.  If the page requested is out of its subspace then the &lt;br /&gt;
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a &lt;br /&gt;
message requesting a page copy.  All messages sent through the CSVS must include a virtual address and the ''job_id'',&lt;br /&gt;
which acts as protection to control access to relevant memory locations.  When received at the appropriate member&lt;br /&gt;
node, the virtual address is translated to a local physical address.&lt;br /&gt;
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]&lt;br /&gt;
&lt;br /&gt;
=====Improvements in communication=====&lt;br /&gt;
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded&lt;br /&gt;
applications to access the network via system calls, significantly increasing latency.  Later software&lt;br /&gt;
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls.  Addressing this and other &lt;br /&gt;
latency sources on both ends of communication were an important goal for projects such as the '''Virtual memory-'''&lt;br /&gt;
'''mapped communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project]. &lt;br /&gt;
&lt;br /&gt;
Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data&lt;br /&gt;
to a receiver defined area of its address space.  In this communication scheme, the receiver process exports areas of its&lt;br /&gt;
address space that will act as receive buffers and sending processes must import the destinations.  There is no explicit &lt;br /&gt;
receive operation in VMMC.  Receivers are able&lt;br /&gt;
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten.  Imported&lt;br /&gt;
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address&lt;br /&gt;
space and can be translated by VMMC to a receiver, process and memory address.  VMMC supports a deliberate update&lt;br /&gt;
request and will update data sent previously to an imported receive buffer.  This transfer occurs directly without receiver&lt;br /&gt;
CPU interruption.&lt;br /&gt;
&lt;br /&gt;
[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]&lt;br /&gt;
=== Programming Environment ===&lt;br /&gt;
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers &lt;br /&gt;
to focus on algorithms instead of processor communication and data tracking.  Many programming environments have been&lt;br /&gt;
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]&lt;br /&gt;
in the 1990s.  TreadMarks was a user-level library that ran on top of Unix.  Programs were written in&lt;br /&gt;
C, C++ or Fortran and then compiled and linked with the TreadMarks library.  &lt;br /&gt;
&lt;br /&gt;
Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial &lt;br /&gt;
differential equation solver.  The code iterates over a 2D array and updates each element to the average of its four&lt;br /&gt;
nearest neighbors.  All processors are assigned an approximately equivalent number of rows and neighboring processes &lt;br /&gt;
share boundary rows as is necessary for the calculation.  This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''].  Barriers prevent race&lt;br /&gt;
conditions.  ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.  &lt;br /&gt;
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier.  In this&lt;br /&gt;
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)'' &lt;br /&gt;
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''&lt;br /&gt;
guarantees all current iteration values are written before any next iteration computation begins.&lt;br /&gt;
&lt;br /&gt;
To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks'''].  This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each&lt;br /&gt;
other node once and returns to the origin node.  The shortest route identified thus far is stored in the shared ''Shortest_length''&lt;br /&gt;
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time.  A process&lt;br /&gt;
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue&lt;br /&gt;
to continue its search.  Process 0 allocates the shared queue and minimum length.  Exclusive access must be established&lt;br /&gt;
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''.  Each&lt;br /&gt;
process acquires the queue lock to identify a promising partial path and releases it upon finding one.  When &lt;br /&gt;
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.&lt;br /&gt;
&lt;br /&gt;
=== DSM Implementations ===&lt;br /&gt;
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are in part or entirely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is obvious and is based on the nature of the implementation as demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 Distributed shared memory: concepts and systems].&lt;br /&gt;
&lt;br /&gt;
Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of all or some of them. &lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Implementation'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Implementation / Cluster configuration'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Network'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Type of Algorithm'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Consistency Model'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Granularity Unit'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Coherence Policy'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''SW/HW/Hybrid'''&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 TreadMarks]||User-level || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style=&amp;quot;padding-left: 2em&amp;quot; | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&amp;amp;tag=1 Linda]||Language || style=&amp;quot;padding-left: 2em&amp;quot; |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&amp;amp;rep=rep1&amp;amp;type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=766965&amp;amp;isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW&lt;br /&gt;
|-&lt;br /&gt;
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D  caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is an explanation of the main characteristics listed in the DSM classification.&lt;br /&gt;
 &lt;br /&gt;
There are three types of DSM algorithm: &lt;br /&gt;
* '''Single Reader/ Single Writer''' (SRSW) &lt;br /&gt;
** central server algorithm - produces long network delays &lt;br /&gt;
** migration algorithm - produces thrashing and false sharing&lt;br /&gt;
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.&lt;br /&gt;
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm.  It has full concurrency and uses atomic updates.&lt;br /&gt;
&lt;br /&gt;
The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. &amp;quot;A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors.&amp;quot; The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:&lt;br /&gt;
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.&lt;br /&gt;
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence. &lt;br /&gt;
* Weak consistency - consistency is required only on synchronization accesses.&lt;br /&gt;
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished. &lt;br /&gt;
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.&lt;br /&gt;
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.&lt;br /&gt;
&lt;br /&gt;
Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).&lt;br /&gt;
&lt;br /&gt;
Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]&lt;br /&gt;
===== SPLASH and SPLASH-2 =====&lt;br /&gt;
The '''Stanford ParalleL Applications for SHared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.&lt;br /&gt;
&lt;br /&gt;
Selected applications in the SPLASH-2 collections include:&lt;br /&gt;
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.&lt;br /&gt;
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.&lt;br /&gt;
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a &amp;quot;one-to-many non-personalized communication&amp;quot;.&lt;br /&gt;
*Barnes: simulates the interaction of a group of particles over time steps. &lt;br /&gt;
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.&lt;br /&gt;
&lt;br /&gt;
===== Case studies =====&lt;br /&gt;
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the &amp;quot;automatic management and coherent replication&amp;quot; of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.&lt;br /&gt;
&lt;br /&gt;
The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Appl.&lt;br /&gt;
!FFT &lt;br /&gt;
!OCEAN &lt;br /&gt;
!LU &lt;br /&gt;
!RADIX &lt;br /&gt;
!SAMPLE &lt;br /&gt;
!N-BODY&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MPI ||222||4320||470||384||479||1371&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SAS ||210 ||2878 ||309||201||450||950&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on the choice of memory consistency model in a DSM. The two types of models under study were the '''sequential consistency''' (SC) model and the relaxed consistency model, in particular the '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.&lt;br /&gt;
&lt;br /&gt;
The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.&lt;br /&gt;
For this SC(with multiview MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, therefore there is a lower cost for these operations.&lt;br /&gt;
&lt;br /&gt;
This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and  L locks.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Application&lt;br /&gt;
! Input data set&lt;br /&gt;
! Shared memory&lt;br /&gt;
!Sharing granularity&lt;br /&gt;
! Synch&lt;br /&gt;
!Allocation pattern&lt;br /&gt;
|-&lt;br /&gt;
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine&lt;br /&gt;
|-&lt;br /&gt;
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse&lt;br /&gt;
|-&lt;br /&gt;
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|-&lt;br /&gt;
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC(with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.&lt;br /&gt;
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Program &lt;br /&gt;
!CVM &lt;br /&gt;
!Quarks &lt;br /&gt;
!Strings&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| FFT||1290||2419||1894&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-c||135||-||485&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| LU-n||385||2873||407&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| OCEAN-c||1955||15475||6676&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-n2||2253||38438||10032&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| WATER-sp||905||7568||1998&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| MATMULT||290||1307||645&lt;br /&gt;
|- style=&amp;quot;text-align: center;&amp;quot;&lt;br /&gt;
| SOR||247||7236||934&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.&lt;br /&gt;
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others. &lt;br /&gt;
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation. &lt;br /&gt;
&lt;br /&gt;
The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.&lt;br /&gt;
&lt;br /&gt;
=== Evolution ===&lt;br /&gt;
A more recent version of a distributed shared memory system is vNUMA.   [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as &amp;quot;a virtual machine that presents a cluster as a virtual shared-memory multiprocessor.&amp;quot; The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS. &lt;br /&gt;
&lt;br /&gt;
The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.&lt;br /&gt;
&lt;br /&gt;
The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected. &lt;br /&gt;
&lt;br /&gt;
In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.&lt;br /&gt;
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]&lt;br /&gt;
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 &amp;quot;A comparative evaluation of hybrid distributed shared-memory systems,&amp;quot;] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52&lt;br /&gt;
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=4708970&amp;amp;isnumber=4708921 &amp;quot;DVMM: A Distributed VMM for Supporting Single System Image on Clusters,&amp;quot;] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 &amp;quot;Message passing vs. shared address space on a cluster of SMPs,&amp;quot;] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001&lt;br /&gt;
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;amp;arnumber=494605&amp;amp;isnumber=10721 &amp;quot;Distributed shared memory: concepts and systems,&amp;quot;] Parallel &amp;amp; Distributed Technology: Systems &amp;amp; Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996&lt;br /&gt;
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&amp;amp;rep=rep1&amp;amp;type=pdf &amp;quot;Design Issues in Implementation of Distributed Shared Memory in User Space,&amp;quot;]&lt;br /&gt;
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf &amp;quot;Distributed Shared Memory:  A Survey of Issues and Algorithms&amp;quot;]&lt;br /&gt;
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 &amp;quot;The SPLASH-2 programs: characterization and methodological considerations,&amp;quot;] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy&lt;br /&gt;
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France&lt;br /&gt;
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=747863 &amp;quot;Cache-Coherent Distributed Shared Memory:  Perspectives on Its Development and Future Chanllenges,&amp;quot;] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA&lt;br /&gt;
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]&lt;br /&gt;
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf &amp;quot;Configurable Shared Virtual Memory for Parallel Computing&amp;quot;] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin&lt;br /&gt;
*Dubnicki, C.;   Iftode, L.;   Felten, E.W.;   Kai Li;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=508084 &amp;quot;Software Support for Virtual Memory-Mapped Communication&amp;quot;] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ &lt;br /&gt;
*Dubnicki, C.;   Bilas, A.;   Li, K.;   Philbin, J.;  , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=580931 &amp;quot;Design and Implementation of Virtual Memory-Mapped Communication on Myrinet&amp;quot;] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ&lt;br /&gt;
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&amp;amp;acc=ACTIVE%20SERVICE&amp;amp;CFID=81831968&amp;amp;CFTOKEN=62928147&amp;amp;__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 &amp;quot;Integrating Message-Passing and Shared-Memory:  Early Experience&amp;quot;] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63&lt;br /&gt;
*Amza, C.;  Cox, A.L.;  Dwarkadas, S.;  Keleher, P.;  Honghui Lu;  Rajamony, R.;  Weimin Yu;  Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=485843&amp;amp;tag=1 &amp;quot;TreadMarks: shared memory computing on networks of workstations&amp;quot;] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX&lt;br /&gt;
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.&lt;br /&gt;
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
The memory hierarchy described for the CSVS system places remote memories:&lt;br /&gt;
# Between main memory and local disk storage&lt;br /&gt;
# Same hierarchy as local disk storage&lt;br /&gt;
# Below local disk storage&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:&lt;br /&gt;
# At the origin of the message, i.e. where the page fault occurs&lt;br /&gt;
# By the DSM system default manager&lt;br /&gt;
# At the location where the desired page resides&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DSM nodes&lt;br /&gt;
# partially map variable amounts of their memory to the distributed address space&lt;br /&gt;
# are configured to supply a contiguous and fixed amount of memory to the distributed address space&lt;br /&gt;
# utilize I/O to access the entirely non-local distributed address space&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SAS programming model:&lt;br /&gt;
# Has evolved beyond MP as it is difficult to program in scalable DSM environments&lt;br /&gt;
# Utilize MP to communicate but rely on the ease of a common address space&lt;br /&gt;
# Has suffered too many security problems, scalable MP now dominates the landscape&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Page management in MOME:&lt;br /&gt;
# Requires consistent address space mapping across all nodes&lt;br /&gt;
# Is managed from a global DSM perspective&lt;br /&gt;
# Allows an F and V page descriptor to occur for the same page on the same node&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The most adopted DSM algorithm is:&lt;br /&gt;
# Single Reader/ Single Writer (SRSW)&lt;br /&gt;
# Multiple Readers/ Single Writer (MRSW)&lt;br /&gt;
# Multiple Readers/Multiple Writers (MRMW)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In Sequential Consistency:&lt;br /&gt;
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors&lt;br /&gt;
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence&lt;br /&gt;
# consistency is required only on synchronization accesses&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
SPLASH is a&lt;br /&gt;
# coherence protocol&lt;br /&gt;
# collection of parallel programs engineered for the evaluation of shared address space machines&lt;br /&gt;
# DSM implementation&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The complexity of development is &lt;br /&gt;
# the same for MP and SAS&lt;br /&gt;
# lower for SAS&lt;br /&gt;
# lower for MP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
vNUMA is a&lt;br /&gt;
# Fast Fourier Transform implementation&lt;br /&gt;
# network implementation&lt;br /&gt;
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor&lt;/div&gt;</summary>
		<author><name>Shvemuri</name></author>
	</entry>
</feed>