Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Spring 2013/12a cm

2013-04-18T04:23:09Z

Mchen4:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short[[#1foot|[1]]]. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message [[#4foot|[4]]]. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

The use of different topologies has changed over the years. The following pie charts from top500.org show the share of different network topologies over the years.

[[Image:ob_wiki_12_2001.png ]] ''Interconnect Family Market Share for 2001''
[[Image:ob_wiki_12_2004.png]] ''Interconnect Family Market Share for 2004''
[[Image:ob_wiki_12_2007.png]] ''Interconnect Family Market Share for 2007''
[[Image:ob_wiki_12_2010.png]] ''Interconnect Family Market Share for 2010''

<h1> Performance Metrics for Interconnection Network Topologies</h1>


The several metrics used for evaluating characteristics of a network topology like latency, bandwidth, cost, etc. are as follows:
*Diameter: This is the longest distance between any pair of nodes of the network. It is measured in terms of network hops (the number of links the message must travel before reaching the destination).
*Bisection Bandwidth: This is the minimum number of links that one must cut in order to partition the network into two halves.
*Degree: This is the total number of links in and out from a router.


The Diameter and Bisection bandwidth are the measure of performance of a network while the degree along with the total number of links is the measure of cost.


<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers [[#7foot|[7]]]. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus[[#8foot|[8]]]. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology[[#9foot|[9]]][[#10foot|[10]]]. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes [[#11foot|[11]]][[#12foot|[12]]].

 
 
 
 
 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube (3-D Mesh)</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 3(p1/3-1) --this is the corner-to-corner distance, analogous to the 2-d mesh formula
* ''Bisection BW:'' p2/3 -- p1/3 rows of p1/3 links must be cut to bisect a cube
* ''# Links:'' 3*p2/3 -- there are p2/3 links in each of 3 dimensions.
* ''Degree:'' 6 (from the inside nodes)
 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2 -- p/2 links run from one N-1 cube to the other.
* ''# Links:'' p/2 * log2(p) -- each node has a degree of log2(p). Multiply by p nodes and divide by 2 nodes per link.
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies. Each node is numbered with a bitstring that is log2(p) bits long. The farthest away node is this bitstring's complement. One bit can be flipped per hop so the diameter is log2(p).

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network[[#2foot|[2]]]. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

[[File:Machines.png|frame|center|''Current Machine Statistics''[[#13foot|[13]]]]]

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between[[#2foot|[2]]].

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''[[#2foot|[2]]]]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with an excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them[[#2foot|[2]]].
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors[[#7foot|[7]]].
 
Mercury Computer System's RACEway, their original interconnect fabric, uses 6-way crossbar chips organized in a fat tree network. The fat tree network was particularly well suited for Fast Fourier Transforms, which was used for signal processing[[#13foot|[13]]].

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken[[#2foot|[2]]].

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''[[#2foot|[2]]]]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures[[#2foot|[2]]].

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''[[#2foot|[2]]]]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''[[#2foot|[2]]]]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors[[#7foot|[7]]].
Originally developed for military applications, wireless mesh networks are now being used in the consumer sector. MIT Media Lab's XO-1 laptop, also known as "OLPC" (One Laptop Per Child) uses mesh networking to create an inexpensive infrastructure. The connections made by the laptops are used to reduce the need for an external infrastructure[[#15foot|[15]]].

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

Intel produced several supercomputers using hypercube design, of which the best known was the iPSC/860. Other early supercomputers, including the first few models of the Conection Machine family from Thinking Machines Corporation, also used the hypercube design. The CRAY T3E, CRAY XT3, and SGI Origin 2000 also used k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''[[#2foot|[2]]]]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1> Why do meshes dominate?</h1>

From the perspective of performance and flexibility for each of the topologies, it looks like higher dimension networks are preferable compared to low dimensional networks. However, in reality, cost of building the network is also an important consideration. A mesh network is much easier to layout because all of the connections can be made in 2 dimensions. Conversely, hypercubes and butterflies contain many crossing wires which may need to be quite long to loop around the edge.
 
In a 2D network, each router is very simple since it only needs to have a degree of 4. A router usually uses crossbar switches to route inputs to outputs and the cost of additional ports increase the complexity quadratically.



In comparison to a 2D mesh, a router for a hypercube is much more complex with a degree of twice the diameter. For example, for 5 dimensions, we need a router of degree 10. With other networks like a butterfly, the complexity of a router remains same at 4 but we would need a larger number of routers as the number of links is more.


As an example, IBM's [http://en.wikipedia.org/wiki/Blue_Gene/Q Blue Gene/Q] uses a 3D mesh interconnect with auxiliary networks for global communications (broadcast and reductions), I/O, and management.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''[[#2foot|[2]]]

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''[[#2foot|[2]]]

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''[[#2foot|[2]]]

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''[[#2foot|[2]]]

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path[[#3foot|[3]]].


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer[[#3foot|[3]]].


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models[[#3foot|[3]]]. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''[[#3foot|[3]]]
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''[[#3foot|[3]]]
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''[[#3foot|[3]]]
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''[[#3foot|[3]]]
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''[[#3foot|[3]]]
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''[[#3foot|[3]]]
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years[[#4foot|[4]]].


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''[[#4foot|[4]]]

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''[[#4foot|[4]]]
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''[[#4foot|[4]]]
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components[[#6foot|[6]]]. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''[[#5foot|[5]]] : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''[[#5foot|[5]]]: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

=References=
[[#1body|1.]] Y. Solihin, ''Fundamentals of Parallel Computer Architecture''. Madison: OmniPress, 2009. 
[[#2body|2.]] http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems 
[[#3body|3.]] http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing 
[[#4body|4.]] http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons 
[[#5body|5.]] http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga. 
[[#6body|6.]] http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad 
[[#7body|7.]] http://www.top500.org TOP500 Supercomputing Sites 
[[#8body|8.]] http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch 
[[#9body|9.]] http://www.myri.com/myrinet/overview/ Myrinet Overview 
[[#10body|10.]] http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network) 
[[#11body|11.]] http://www.google.com/research/pubs/pub35155.html Dragonfly Topology 
[[#12body|12.]] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology 
[[#13body|13.]] http://courses.engr.illinois.edu/cs533/sp2012/notes/InterconnectionNet.pdf 
[[#14body|14.]] http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf 
[[#15body|15.]] http://wiki.laptop.org/go/Mesh_Network_Details

CSC/ECE 506 Spring 2013/2a lm

2013-02-15T20:44:36Z

Mchen4: Added new reference

[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]

== SAS programming on distributed-memory machines ==
[https://docs.google.com/a/ncsu.edu/document/d/1898MW7jXRhuz40HXXiTsobSUDdUVBZ-aUjEyLdeQdNc/edit#, Topic Writeup]

[http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/2a_bm Original Page]

[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines, such as clusters of servers. Distributed systems are groups of computers that communicate through a network and share a common work goal. Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access the shared memory, this arrangement is called Distributed Shared Memory and is discussed below. Relevant issues that come to bear include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.

=== Background ===
Distributed memory systems are multi-processor systems in which each processor has its own individual memory. Tasks can only operate on a processor's local memory and if non-local data is required, the processor must communicate with one or more remote processors. Distributed memory systems started to flourish in the 1980s. The increasing performance in processors and network connectivity offered the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This was where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.

In 1985, Cheriton, in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 "Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems"], introduced ideas for the application of shared memory techniques in distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory using a common file namespace that could "decentralize the implementation of a service."

Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today. In a message passing (MP) model, each processor's local memory can be considered as isolated from that of the rest of the system. Processes or objects can send or receive messages in order to communicate, and this can occur in a synchronous or asynchronous manner. In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become challenging with many control threads. A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory. Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems.

=== Distributed Shared Memory (DSM) ===
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]
Most commonly, a distributed system utilizing SAS will consist of a set of nodes connected by a network. Nodes may be comprised of individual processors or a multiprocessor system (e.g. [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus. Each node itself contains a local memory, which maps partially to the distributed address space. Relevant design elements of early SAS implementations included scalability, coherence, structure and granularity. Most early examples did not structure memory, that is the layout of shared memory was simply a linear array of words. Some, however, structured data as objects or language types. '''IVY''' , an early example of a DSM system, implemented shared memory as virtual memory. The granularity, or unit share size, for IVY was in 1-Kbyte pages and the memory was unstructured. A problem when considering optimal page size is the balance between a process likely needing quick access to a large range of the shared address space, which argues for a larger page size, countered by the greater contention for individual pages that the larger page may cause amongst processes and the [http://en.wikipedia.org/wiki/False_sharing '''false sharing'''] it may lead to. [http://en.wikipedia.org/wiki/Memory_coherence Memory coherence] is another important design element consideration, and semantics can be instituted that run gradations of strict to weak consistencies. The strictest consistency guarantees that a read returns the most recently written value. Weaker consistencies may use synchronization operations to guarantee sequential consistency.

==== Cache-Coherent DSM ====

Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was
related to its location. These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture. NUMA architectures were difficult to program due
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching.
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)
were invalidated. These protocols do not scale to DSM machines and different approaches are necessary.

Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] protocol where an extra directory structure keeps track of all blocks that have been cached by each processor. A coherence protocol can then establish a consistent view of
memory by maintaining state and other information about each cached block. These states usually minimally include Invalid,
Shared, and Exclusive. Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate
with the cache block it describes in the physical local memory.

==== User-level DSM ====
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]

Another form of SAS is a User-level DSM system. In this arrangement, shared memory does not exist until defined by the programmer. Through explicit commands, segments of a processor's private memory become mapped and available as shared memory.

An in depth example of a user-level DSM system is [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=1199404 Mome]. Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.

===== Mome Segment creation =====

Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes. Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region. Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes. If mappings are consistent between processes, however, then pointers may be shared by them. Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.

===== Page Management in Mome =====

Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node. The page manager acts upon collections of nodes according to these characteristics for each page:
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page. A new version of
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency
request.

===== Memory mapping in Mome =====

The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node. The DSM memory size
shown is 22 pages. When a new segment is created on a node a segment descriptor is created on that node. In this case the
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page. Each block also contains
three DSM memory references for current, modified and next version of pages. The memory organization state shows an
application with two mappings, M1 and M2, with segment offsets at 0 and 8. The six pages of M1 are managed by segment
descriptor blocks 0 to 5. The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer. The communication
layer manages incoming messages from other nodes.

[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]

==== Configurable Shared Virtual Space ====
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as
its own local disk storage. [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point
communication at the node level is supported through message passing, and the specific mechanism for communication is
agreed to by all nodes.

Yoon describes a DSM system that generates a shared virtual memory on a per job basis. A '''configurable shared virtual address space'''
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates
an information table in its memory:

''JOB_INFORMATION {''
''status;''
''number_of_tasks;''
''number_of_completed_tasks;''
''*member_list;'' /*pointer to first member*/
''number_of_members;''
''IO_server;''
''}''

The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through
a task distribution process during address space assignment. All tasks associated with the program are tagged with
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system. The
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new
CSVS to all other members. Subspace assignment for the SAS model ensues under the specific ''job_id''.

The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION''
table which is consulted to identify the default manager when a page fault occurs. When a page fault does occur, the MMU
locates the default manager and handles the fault normally. If the page requested is out of its subspace then the
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a
message requesting a page copy. All messages sent through the CSVS must include a virtual address and the ''job_id'',
which acts as protection to control access to relevant memory locations. When received at the appropriate member
node, the virtual address is translated to a local physical address.
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]

=====Improvements in communication=====
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded
applications to access the network via system calls, significantly increasing latency. Later software
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls. Addressing this and other
latency sources on both ends of communication were an important goal for projects such as the '''Virtual Memory-Mapped Communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project].

Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data
to a receiver defined area of its address space. In this communication scheme, the receiver process exports areas of its
address space that will act as receive buffers and sending processes must import the destinations. There is no explicit
receive operation in VMMC. Receivers are able
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten. Imported
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address
space and can be translated by VMMC to a receiver, process and memory address. VMMC supports a deliberate update
request and will update data sent previously to an imported receive buffer. This transfer occurs directly without receiver
CPU interruption.

[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]

=== Programming Environment ===
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers
to focus on algorithms instead of processor communication and data tracking. Many programming environments have been
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=485843&tag=1 TreadMarks]
in the 1990s. TreadMarks was a user-level library that ran on top of Unix. Programs were written in
C, C++ or Fortran and then compiled and linked with the TreadMarks library.

Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial
differential equation solver. The code iterates over a 2D array and updates each element to the average of its four
nearest neighbors. All processors are assigned an approximately equivalent number of rows and neighboring processes
share boundary rows as is necessary for the calculation. This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization''']. Barriers prevent race
conditions. ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier. In this
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)''
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''
guarantees all current iteration values are written before any next iteration computation begins.

To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks''']. This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each
other node once and returns to the origin node. The shortest route identified thus far is stored in the shared ''Shortest_length''
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time. A process
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue
to continue its search. Process 0 allocates the shared queue and minimum length. Exclusive access must be established
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''. Each
process acquires the queue lock to identify a promising partial path and releases it upon finding one. When
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.

=== Notable DSM Implementations ===
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are either partially or completely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is based on the nature of the memory demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=494605&isnumber=10721 Distributed shared memory: concepts and systems].

Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of these.

{| {{table}}
| align="center" style="background:#f0f0f0;"|'''Implementation'''
| align="center" style="background:#f0f0f0;"|'''Type of Implementation / Cluster configuration'''
| align="center" style="background:#f0f0f0;"|'''Network'''
| align="center" style="background:#f0f0f0;"|'''Type of Algorithm'''
| align="center" style="background:#f0f0f0;"|'''Consistency Model'''
| align="center" style="background:#f0f0f0;"|'''Granularity Unit'''
| align="center" style="background:#f0f0f0;"|'''Coherence Policy'''
| align="center" style="background:#f0f0f0;"|'''SW/HW/Hybrid'''
|-
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style="padding-left: 2em" |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW
|-
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style="padding-left: 2em" | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW
|-
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=485843&tag=1 TreadMarks]||User-level || style="padding-left: 2em" |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW
|-
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style="padding-left: 2em" | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW
|-
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style="padding-left: 2em" |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW
|-
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&tag=1 Linda]||Language || style="padding-left: 2em" |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW
|-
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW
|-
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW
|-
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&rep=rep1&type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW
|-
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=766965&isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW
|-
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid
|-
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid
|-
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid
|-
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid
|-
|}

Below is an explanation of the main characteristics listed in the DSM classification.

There are three types of DSM algorithm:
* '''Single Reader/ Single Writer''' (SRSW)
** central server algorithm - produces long network delays
** migration algorithm - produces thrashing and false sharing
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm. It has full concurrency and uses atomic updates.

The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. "A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors." The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence.
* Weak consistency - consistency is required only on synchronization accesses.
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished.
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.

Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).

Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.

=== Performance ===
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]
===== SPLASH and SPLASH-2 =====
The '''Stanford ParalleL Applications for SHared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.

Selected applications in the SPLASH-2 collections include:
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a "one-to-many non-personalized communication".
*Barnes: simulates the interaction of a group of particles over time steps.
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.

===== Case Study - 2001 - Shan et al. =====
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the "automatic management and coherent replication" of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.

The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).
{| class="wikitable"
|-
!Appl.
!FFT
!OCEAN
!LU
!RADIX
!SAMPLE
!N-BODY
|- style="text-align: center;"
| MPI ||222||4320||470||384||479||1371
|- style="text-align: center;"
| SAS ||210 ||2878 ||309||201||450||950
|}

The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.

===== Case Study - 2004 - Iosevich and Schuster =====

In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on two memory consistency models in a DSM, the '''sequential consistency''' (SC) model and a relaxed consistency model called '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.

The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] (MV) technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.
For this SC (with MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, resulting in a lower cost for these operations.

This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and L locks.

{| class="wikitable"
|-
!Application
! Input data set
! Shared memory
!Sharing granularity
! Synch
!Allocation pattern
|-
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine
|-
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine
|-
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse
|-
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse
|-
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine
|-
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse
|-
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine
|-
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse
|-
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine
|-
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse
|-
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine
|-
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine
|}

The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC (with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically.

===== Case Study - 2008 - Roy and Chaudhary =====

In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.

{| class="wikitable"
|-
!Program
!CVM
!Quarks
!Strings
|- style="text-align: center;"
| FFT||1290||2419||1894
|- style="text-align: center;"
| LU-c||135||-||485
|- style="text-align: center;"
| LU-n||385||2873||407
|- style="text-align: center;"
| OCEAN-c||1955||15475||6676
|- style="text-align: center;"
| WATER-n2||2253||38438||10032
|- style="text-align: center;"
| WATER-sp||905||7568||1998
|- style="text-align: center;"
| MATMULT||290||1307||645
|- style="text-align: center;"
| SOR||247||7236||934
|}

The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others.
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation.

The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.

=== Evolution ===
A more recent version of a distributed shared memory system is vNUMA. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as "a virtual machine that presents a cluster as a virtual shared-memory multiprocessor." The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS.

The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.

The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected.

In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.

==See also==
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 "A comparative evaluation of hybrid distributed shared-memory systems,"] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=4708970&isnumber=4708921 "DVMM: A Distributed VMM for Supporting Single System Image on Clusters,"] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008

==References==
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 "Message passing vs. shared address space on a cluster of SMPs,"] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=494605&isnumber=10721 "Distributed shared memory: concepts and systems,"] Parallel & Distributed Technology: Systems & Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&rep=rep1&type=pdf "Design Issues in Implementation of Distributed Shared Memory in User Space,"]
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf "Distributed Shared Memory: A Survey of Issues and Algorithms"]
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 "The SPLASH-2 programs: characterization and methodological considerations,"] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=747863 "Cache-Coherent Distributed Shared Memory: Perspectives on Its Development and Future Chanllenges,"] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf "Configurable Shared Virtual Memory for Parallel Computing"] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin
*Dubnicki, C.; Iftode, L.; Felten, E.W.; Kai Li; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=508084 "Software Support for Virtual Memory-Mapped Communication"] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ
*Dubnicki, C.; Bilas, A.; Li, K.; Philbin, J.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=580931 "Design and Implementation of Virtual Memory-Mapped Communication on Myrinet"] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&acc=ACTIVE%20SERVICE&CFID=81831968&CFTOKEN=62928147&__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 "Integrating Message-Passing and Shared-Memory: Early Experience"] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63
*Amza, C.; Cox, A.L.; Dwarkadas, S.; Keleher, P.; Honghui Lu; Rajamony, R.; Weimin Yu; Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=485843&tag=1 "TreadMarks: shared memory computing on networks of workstations"] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.

*Protic, Jelica, Milo Tomagevic, and Veljk Milutinovic. [http://www.cs.rit.edu/~pns6910/docs/Distributed%20Shared%20Memory%20Systems/A%20survey%20of%20distributed%20shared%20memory%20systems.pdf "A Survey of Distributed Shared Memory Systems."] 28th Annual Hawaii International Conference on System Sciences. IEEE. Hawaii, 1995. Reading.

== Quiz ==
The memory hierarchy described for the CSVS system places remote memories:
# Between main memory and local disk storage
# Same hierarchy as local disk storage
# Below local disk storage

When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:
# At the origin of the message, i.e. where the page fault occurs
# By the DSM system default manager
# At the location where the desired page resides

DSM nodes
# partially map variable amounts of their memory to the distributed address space
# are configured to supply a contiguous and fixed amount of memory to the distributed address space
# utilize I/O to access the entirely non-local distributed address space

The SAS programming model:
# Has evolved beyond MP as it is difficult to program in scalable DSM environments
# Utilize MP to communicate but rely on the ease of a common address space
# Has suffered too many security problems, scalable MP now dominates the landscape

Page management in MOME:
# Requires consistent address space mapping across all nodes
# Is managed from a global DSM perspective
# Allows an F and V page descriptor to occur for the same page on the same node

The most adopted DSM algorithm is:
# Single Reader/ Single Writer (SRSW)
# Multiple Readers/ Single Writer (MRSW)
# Multiple Readers/Multiple Writers (MRMW)

In Sequential Consistency:
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence
# consistency is required only on synchronization accesses

SPLASH is a
# coherence protocol
# collection of parallel programs engineered for the evaluation of shared address space machines
# DSM implementation

The complexity of development is
# the same for MP and SAS
# lower for SAS
# lower for MP

vNUMA is a
# Fast Fourier Transform implementation
# network implementation
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor

CSC/ECE 506 Spring 2013/2a lm

2013-02-15T20:36:16Z

Mchen4: Updated background section

[[File:Dsm.jpg|300px|thumb|left|SCD's IBM SP system blackforest, a distributed shared memory ('''DSM''') system]]

== SAS programming on distributed-memory machines ==
[https://docs.google.com/a/ncsu.edu/document/d/1898MW7jXRhuz40HXXiTsobSUDdUVBZ-aUjEyLdeQdNc/edit#, Topic Writeup]

[http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/2a_bm Original Page]

[http://en.wikipedia.org/wiki/Shared_memory '''Shared Address Space'''] (SAS) programming on distributed memory machines is a programming abstraction that provides less development effort than that of the traditional method of [http://en.wikipedia.org/wiki/Message_passing '''Message Passing'''] (MP) on distributed memory machines, such as clusters of servers. Distributed systems are groups of computers that communicate through a network and share a common work goal. Distributed systems typically do not physically share the same memory (are not [http://en.wikipedia.org/wiki/Coupling_%28computer_programming%29 '''tightly coupled''']) but rather each processor or group of processors must depend on mechanisms other than direct memory access the shared memory, this arrangement is called Distributed Shared Memory and is discussed below. Relevant issues that come to bear include [http://en.wikipedia.org/wiki/Memory_coherence '''memory coherence'''], types of memory access, data and process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization'''], and performance.

=== Background ===
Distributed memory systems are multi-processor systems in which each processor has its own individual memory. Tasks can only operate on a processor's local memory and if non-local data is required, the processor must communicate with one or more remote processors. Distributed memory systems started to flourish in the 1980s. The increasing performance in processors and network connectivity offered the perfect environment for parallel processing over a network of computers. This was a cheap way to put together massive computing power. The main drawback was going from sequential programs made for local memory to parallel programming in shared memory. This was where SAS provided the means to simplify programming by hiding the mechanisms to access distant memory located in other computers of the cluster.

In 1985, Cheriton, in his article [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 "Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems"], introduced ideas for the application of shared memory techniques in distributed memory systems. Cheriton envisioned a system of nodes with a pool of shared memory using a common file namespace that could "decentralize the implementation of a service."

Early distributed computer systems relied almost exclusively on message passing in order to communicate with one another, and this technique is still widely used today. In a message passing (MP) model, each processor's local memory can be considered as isolated from that of the rest of the system. Processes or objects can send or receive messages in order to communicate, and this can occur in a synchronous or asynchronous manner. In distributed systems, and particularly with certain types of programs, the message passing model can become overly burdensome to the programmer as tracking data movement and maintaining data integrity can become challenging with many control threads. A shared address or shared-memory system, however, can provide a programming model that simplifies data sharing via uniform mechanisms of data structure reads and writes on common memory. Current distributed systems seek to take advantage both SAS and MP programming model principles in hybrid systems.

=== Distributed Shared Memory (DSM) ===
[[File:Dsmd.jpg|400px|thumb|right|Distributed Shared Memory]]
Most commonly, a distributed system utilizing SAS will consist of a set of nodes connected by a network. Nodes may be comprised of individual processors or a multiprocessor system (e.g. [http://en.wikipedia.org/wiki/Symmetric_multiprocessing '''Symmetric Multiprocessor'''] (SMP)), the latter typically sharing a system bus. Each node itself contains a local memory, which maps partially to the distributed address space. Relevant design elements of early SAS implementations included scalability, coherence, structure and granularity. Most early examples did not structure memory, that is the layout of shared memory was simply a linear array of words. Some, however, structured data as objects or language types. '''IVY''' , an early example of a DSM system, implemented shared memory as virtual memory. The granularity, or unit share size, for IVY was in 1-Kbyte pages and the memory was unstructured. A problem when considering optimal page size is the balance between a process likely needing quick access to a large range of the shared address space, which argues for a larger page size, countered by the greater contention for individual pages that the larger page may cause amongst processes and the [http://en.wikipedia.org/wiki/False_sharing '''false sharing'''] it may lead to. [http://en.wikipedia.org/wiki/Memory_coherence Memory coherence] is another important design element consideration, and semantics can be instituted that run gradations of strict to weak consistencies. The strictest consistency guarantees that a read returns the most recently written value. Weaker consistencies may use synchronization operations to guarantee sequential consistency.

==== Cache-Coherent DSM ====

Early DSM systems implemented a shared address space where the amount of time required to access a piece of data was
related to its location. These systems became known as [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access '''Non-Uniform Memory Access'''] (NUMA), whereas an SMP type
system is known as [http://en.wikipedia.org/wiki/Uniform_Memory_Access '''Uniform Memory Access'''] (UMA) architecture. NUMA architectures were difficult to program due
to potentially significant differences in data access times. SMP architectures dealt with this problem through caching.
Protocols were established that ensured prior to writing a location, all other copies of the location (in other caches)
were invalidated. These protocols do not scale to DSM machines and different approaches are necessary.

Cache-coherent DSM architectures rely on a directory-based [http://en.wikipedia.org/wiki/Cache_coherency '''cache coherence'''] protocol where an extra directory structure keeps track of all blocks that have been cached by each processor. A coherence protocol can then establish a consistent view of
memory by maintaining state and other information about each cached block. These states usually minimally include Invalid,
Shared, and Exclusive. Furthermore, in a cache-coherent DSM machine, the directory is distributed in memory to associate
with the cache block it describes in the physical local memory.

==== User-level DSM ====
[[File:Untitled_Project.jpg|350px|thumb|left|Memory Mapping in Mome]]

Another form of SAS is a User-level DSM system. In this arrangement, shared memory does not exist until defined by the programmer. Through explicit commands, segments of a processor's private memory become mapped and available as shared memory.

An in depth example of a user-level DSM system is [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=1199404 Mome]. Mome, in 2003, was a run-time model that mapped Mome segments onto node private address space.

===== Mome Segment creation =====

Segment creation was initiated through a ''MomeCreateSegment(size)'' call which returned an identifier for mapping used by all nodes. Any process can request for a mapping of a section of its local memomy to a Mome segment section by calling ''MomeMap(Addr, Lg, Prot, Flags, Seg, Offset)'', which returns the starting address of the mapped region. Each mapping request made by a process is independent and the addresses of the mappings may or may not be consistent on all nodes. If mappings are consistent between processes, however, then pointers may be shared by them. Mome supports strong and weak consistency models, and for any particular page each node is able to dynamically manage its consistency during program execution.

===== Page Management in Mome =====

Mome manages [http://en.wikipedia.org/wiki/Page_%28computer_memory%29 '''pages'''] in a directory based scheme where each page directory maintains the status of six characteristics per page on each node. The page manager acts upon collections of nodes according to these characteristics for each page:
V nodes posses the current version, M nodes have a modified version, S nodes want strong consistency, I nodes are
invalidated, F nodes have initiated a modification merge and H nodes are a special type of hidden page. A new version of
a page is created prior to a constraint violation and before modifications are integrated as a result of a consistency
request.

===== Memory mapping in Mome =====

The Mome memory mapping figure to the left shows a possible DSM memory organization on a single node. The DSM memory size
shown is 22 pages. When a new segment is created on a node a segment descriptor is created on that node. In this case the
segment descriptor is 12 pages, with each segment descriptor block corresponding to one page. Each block also contains
three DSM memory references for current, modified and next version of pages. The memory organization state shows an
application with two mappings, M1 and M2, with segment offsets at 0 and 8. The six pages of M1 are managed by segment
descriptor blocks 0 to 5. The descriptor blocks (and application memory) show that pages 1,2 and 5 have no associated
memory, while M1 page 0 is mapped to block 6 as a current version and M1 page 3 is mapped to block 13 as a current version,
block 8 as a modified version, and has initiated a modifications merge as indicated by the block 17 pointer. The communication
layer manages incoming messages from other nodes.

[[File:Mem hierarchy.png|200px|thumb|right|Memory hierarchy of node]]

==== Configurable Shared Virtual Space ====
As described by [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf Yoon, et al.] in 1994 the communication paradigm
for their DSM node network relies on a memory hierarchy for each node that places remote memories at the same hierarchy as
its own local disk storage. [http://en.wikipedia.org/wiki/Page_fault '''Page faults'''] within a given node that can be resolved within disk storage are handled
normally while those that cannot are resolved between node main memory and memory of other nodes. Point to point
communication at the node level is supported through message passing, and the specific mechanism for communication is
agreed to by all nodes.

Yoon describes a DSM system that generates a shared virtual memory on a per job basis. A '''configurable shared virtual address space'''
(CSVS) is readied for when a member node receives a job, generates a job identification number and creates
an information table in its memory:

''JOB_INFORMATION {''
''status;''
''number_of_tasks;''
''number_of_completed_tasks;''
''*member_list;'' /*pointer to first member*/
''number_of_members;''
''IO_server;''
''}''

The ''status'' refers to the creation of the CSVS and ''number_of_members'' and ''member_list'' are established through
a task distribution process during address space assignment. All tasks associated with the program are tagged with
the ''job_id'' and ''requester_id'' and, following address space assignment, are distributed across the system. The
actual CSVS creation occurs when the first task of a job is initiated by a member, who requests the generation of the new
CSVS to all other members. Subspace assignment for the SAS model ensues under the specific ''job_id''.

The [http://en.wikipedia.org/wiki/Operating_system '''operating system'''] (OS) or [http://en.wikipedia.org/wiki/Memory_management_unit '''memory management unit'''] (MMU) of each member maintains a copy of the ''JOB_INFORMATION''
table which is consulted to identify the default manager when a page fault occurs. When a page fault does occur, the MMU
locates the default manager and handles the fault normally. If the page requested is out of its subspace then the
virtual address, ''job_id'', and default manager identification are sent to the [http://en.wikipedia.org/wiki/Control_unit '''control unit'''] (CU) to construct a
message requesting a page copy. All messages sent through the CSVS must include a virtual address and the ''job_id'',
which acts as protection to control access to relevant memory locations. When received at the appropriate member
node, the virtual address is translated to a local physical address.
[[File:Jacobi_code.jpg|300px|thumb|left|Jacobi method pseudocode using TreadMarks API]]

=====Improvements in communication=====
Early SAS programming models in DSM environments suffered from poor performance because protection schemes demanded
applications to access the network via system calls, significantly increasing latency. Later software
systems and network interfaces arose that were able to ensure safety without incurring the time cost of the system calls. Addressing this and other
latency sources on both ends of communication were an important goal for projects such as the '''Virtual Memory-Mapped Communication''' (VMMC) model that was developed as part of the [http://shrimp.cs.princeton.edu/index.html Shrimp Project].

Protection is achieved in VMMC because the receiver must grant permission before the sender is allowed to transfer data
to a receiver defined area of its address space. In this communication scheme, the receiver process exports areas of its
address space that will act as receive buffers and sending processes must import the destinations. There is no explicit
receive operation in VMMC. Receivers are able
to define which senders can import specific buffers and VMMC ensures only receiver buffer space is overwritten. Imported
receive buffers are mapped to a destination proxy space which can be implemented as part of the sender's virtual address
space and can be translated by VMMC to a receiver, process and memory address. VMMC supports a deliberate update
request and will update data sent previously to an imported receive buffer. This transfer occurs directly without receiver
CPU interruption.

[[File:Shortest_path_pseudocode.jpg|300px|thumb|right|Shortest path pseudocode using TreadMarks API]]

=== Programming Environment ===
The globally shared memory abstraction provided through virtual memory or some other DSM mechanism allows programmers
to focus on algorithms instead of processor communication and data tracking. Many programming environments have been
developed for DSM systems including Rice University's [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=485843&tag=1 TreadMarks]
in the 1990s. TreadMarks was a user-level library that ran on top of Unix. Programs were written in
C, C++ or Fortran and then compiled and linked with the TreadMarks library.

Shown at left is a pseudocode example of using the TreadMarks API to implement the Jacobi method, a type of partial
differential equation solver. The code iterates over a 2D array and updates each element to the average of its four
nearest neighbors. All processors are assigned an approximately equivalent number of rows and neighboring processes
share boundary rows as is necessary for the calculation. This example shows TreadMarks use of [http://en.wikipedia.org/wiki/Barrier_%28computer_science%29 '''barriers'''], a technique used for process [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 '''synchronization''']. Barriers prevent race
conditions. ''void Tmk_startup(int argc, char **argv'') initializes TreadMarks and starts the remote processes.
The ''void Tmk_barrier(unsigned id)'' call blocks the calling process until every other process arrives at the barrier. In this
example, ''Tmk_barrier(0)'' guarantees that process 0 completes initialization before any process proceeds, ''Tmk_barrier(1)''
guarantees all previous iteration values are read before any current iteration values are written, and ''Tmk_barrier(2)''
guarantees all current iteration values are written before any next iteration computation begins.

To the right is shown a short pseudocode program exemplifying another SAS synchronization technique which uses [http://en.wikipedia.org/wiki/Lock_%28computer_science%29 '''locks''']. This program calculates the shortest path in a grouping of nodes that starts at any designated start node, visits each
other node once and returns to the origin node. The shortest route identified thus far is stored in the shared ''Shortest_length''
and investigated routes are kept in a queue, most promising at the front, and expanded one node at a time. A process
compares its resulting shortest partial path with ''Shortest_length'', updating if necessary and returns to the queue
to continue its search. Process 0 allocates the shared queue and minimum length. Exclusive access must be established
and maintained to ensure correctness and this is achieved through a lock on the queue and ''Shortest_length''. Each
process acquires the queue lock to identify a promising partial path and releases it upon finding one. When
increasing the ''Shortest_path'' a lock is acquired to ensure [http://en.wikipedia.org/wiki/Mutual_exclusion '''mutual exclusion'''] to update this shared data as well.

=== Notable DSM Implementations ===
From an architectural point of view, DSMs are composed of several nodes connected via a network. Each of the nodes can be an individual machine or a cluster of machines. Each system has local memory modules that are either partially or completely part of the shared memory. There are many characteristics that can be used to classify DSM implementations. One of them is based on the nature of the memory demarcation: Software, Hardware, and Hybrid. This historical classification has been extracted from [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=494605&isnumber=10721 Distributed shared memory: concepts and systems].

Software DSM implementations refer to the DSM implemented by using user-level software, OS, programming language, or combination of these.

{| {{table}}
| align="center" style="background:#f0f0f0;"|'''Implementation'''
| align="center" style="background:#f0f0f0;"|'''Type of Implementation / Cluster configuration'''
| align="center" style="background:#f0f0f0;"|'''Network'''
| align="center" style="background:#f0f0f0;"|'''Type of Algorithm'''
| align="center" style="background:#f0f0f0;"|'''Consistency Model'''
| align="center" style="background:#f0f0f0;"|'''Granularity Unit'''
| align="center" style="background:#f0f0f0;"|'''Coherence Policy'''
| align="center" style="background:#f0f0f0;"|'''SW/HW/Hybrid'''
|-
| [http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/vm-and-gc/ivy-shared-virtual-memory-li-icpp-1988.pdf IVY]||User-level library + OS modification || style="padding-left: 2em" |- ||MRSW ||Sequential ||1 Kbyte ||Invalidate ||SW
|-
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/121133.121159 Munin]||Runtime system + linker + library + preprocessor + OS modifications ||style="padding-left: 2em" | - ||Type-specific (SRSW, MRSW, MRMW) ||Release ||Variable size objects ||Type-specific (delayed update, invalidate) ||SW
|-
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=485843&tag=1 TreadMarks]||User-level || style="padding-left: 2em" |- ||MRMW ||Lazy release ||4 Kbytes ||Update, Invalidate ||SW
|-
| [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/74851.74871 Mirage]||OS kernel ||style="padding-left: 2em" | - ||MRSW ||Sequential ||512 bytes ||Invalidate ||SW
|-
| [http://onlinelibrary.wiley.com.prox.lib.ncsu.edu/doi/10.1002/spe.4380210503/pdf Clouds]||OS, out of kernel || style="padding-left: 2em" |- ||MRSW ||Inconsistent, sequential ||8 Kbytes ||Discard segment when unlocked ||SW
|-
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1663305&tag=1 Linda]||Language || style="padding-left: 2em" |- ||MRSW ||Sequential ||Variable (tuple size) ||Implementation- dependent ||SW
|-
| [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=93183 Memnet]||Single processor, Memnet device||Token ring||MRSW||Sequential||32 bytes||Invalidate||HW
|-
| [http://en.wikipedia.org/wiki/Scalable_Coherent_Interface SCI]||Arbitrary||Arbitrary||MRSW||Sequential||16 bytes||Invalidate||HW
|-
| [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.8026&rep=rep1&type=pdf KSR1]||64-bit custom PE, I+D caches, 32M local memory||Ring-based||MRSW||Sequential||128 bytes||Invalidate||HW
|-
| [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=766965&isnumber=16621 RMS]||1-4 processors, caches, 256M local memory||RM bus||MRMW||Processor||4 bytes||Update||HW
|-
| [http://en.wikipedia.org/wiki/Alewife_(multiprocessor) Alewife]||Sparcle PE, 64K cache, 4M local memory, CMMU|| mesh||MRSW||Sequential||16 Kbytes||Invalidate||Hybrid
|-
| [http://dl.acm.org/citation.cfm?id=192056 Flash]||MIPS T5, I +D caches, Magic controller|| mesh||MRSW||Release||128 Kbytes||Invalidate||Hybrid
|-
| [http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s08/papers/10_tempest.pdf Typhoon]||SuperSparc, 2-L caches|| NP controller||MRSW||Custom||32 Kbytes||Invalidate custom||Hybrid
|-
| [http://shrimp.cs.princeton.edu/ Shrimp]||16 Pentium PC nodes|| Intel Paragon routing network||MRMW||AURC, scope||4 Kbytes||Update/Invalidate||Hybrid
|-
|}

Below is an explanation of the main characteristics listed in the DSM classification.

There are three types of DSM algorithm:
* '''Single Reader/ Single Writer''' (SRSW)
** central server algorithm - produces long network delays
** migration algorithm - produces thrashing and false sharing
* '''Multiple Readers/ Single Writer''' (MRSW) - read replication algorithm. It uses write invalidate. MRSW is the most adopted algorithm.
* '''Multiple Readers/Multiple Writers''' (MRMW) - full replication algorithm. It has full concurrency and uses atomic updates.

The consistency model plays a fundamental role in DSM systems. Due to the nature of the distributed systems, memory accesses are constrained in the different consistency models. "A memory consistency model defines the legal ordering of memory references issued by some processor, as observed by other processors." The stricter the consistency model, the higher the access times, but programming is more simplified. Some of the consistency models types are:
* Sequential consistency - all processors see the same ordering of memory references, and these are issued in sequences by the individual processors.
* Processor consistency - the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence.
* Weak consistency - consistency is required only on synchronization accesses.
* Release consistency - divides the synchronization accesses into acquire and release. Normal read and writes for a particular node can be done only after all acquires for that node are finished. Similarly, releases can only be done after all writes and reads are finished.
* Lazy Release consistency - extends the Release consistency, by propagating modifications to the shared data only on the acquire, and of those, only the ones related to critical sections.
* Entry consistency - synchronization is performed at variable level. This increases programming labor but helps with lowering latency and traffic exchanges as only the specific variable needs to be synchronized.

Granularity refers to the unit of data blocks that are managed by the coherence protocols. The unit differs between hardware and software systems, as hardware systems tend to use smaller size blocks than the virtual layer that manages the data in the software systems. The problem with larger size blocks is that the probability for contingency is higher, even when the different processors involved are not accessing the exact same piece of memory, just a part contained in the block size. This is known as false sharing and creates thrashing (memory blocks keep being requested by processors and processors keep waiting for the same memory blocks).

Coherence policy regulates data replication. The coherence policy dictates if the data that is being written at a site should be invalidated or updated at the remote sites. Usually, systems with fine-grain coherence (byte/word) impose the update policy, whereas the systems based on coarse-grain (page) coherence utilize the invalidate policy. This is also known in other parts of the literature as coherence protocol. And the two types of protocols are known as write-invalidate and write-update. The write-invalidate protocol invalidates all the copies except one before writing to it. In contrast, write-update maintains all copies updated.

=== Performance ===
There are numerous studies of the performance of shared memory applications in distributed systems. The vast majority of them use a collection of programs named [http://dl.acm.org/citation.cfm?id=223990 SPLASH and SPLASH-2.]
===== SPLASH and SPLASH-2 =====
The '''Stanford ParalleL Applications for SHared memory''' (SPLASH) is a collection of parallel programs engineered for the evaluation of shared address space machines. These programs have been used by research studies to provide measurements and analysis of different aspects of the emerging DSM architectures at the time. A subsequent suite of programs (SPLASH-2) evolved from the necessity of improving on the SPLASH programs limitations. SPLASH-2 covers a more ample domain of scientific programs, makes use of improved algorithms, and pays more attention to the architecture of the underlying systems.

Selected applications in the SPLASH-2 collections include:
*FFT: a '''Fast Fourier Transform''' implementation, in which the data is organized in source and destination matrices so that processors have stored in their local memory a contiguous set of rows. In this application all processors involved communicate among them, sending data to each other, to evaluate a matrix transposition.
*Ocean: calculations of large scale ocean movements simulating eddy currents. For the purpose of calculations, it shows nearest-neighbors accessing patterns in multi-grid formation as opposed to using a single grid.
*LU: matrix decomposition in the product of an upper triangular and lower triangular matrices. LU exhibits a "one-to-many non-personalized communication".
*Barnes: simulates the interaction of a group of particles over time steps.
*Radix sort: integer sorting algorithm. This algorithm implementation displays an example of communication among all the processors involved, and the nature of this communication presents irregular patterns.

===== Case Study - 2001 - Shan et al. =====
In 2001, [http://escholarship.org/uc/item/76p9b40g#page-1 Shan et al.] presented a comparison of the performance and programming effort of MP versus SAS running on clusters of '''Symmetric Memory Processors''' (SMPs). They highlighted the "automatic management and coherent replication" of the SAS programming model which facilitates the programming tasks in these types of clusters. This study uses MPI/Pro protocol for the MP programming model and GeNIMA SVM protocol (a page-based shared virtual memory protocol) for SAS on a 32 processors system (using a cluster of 8 machines with 4-way SMPs each). The subset of applications used involves regularly structured applications as FFT, Ocean, and LU contrasting with irregular ones as for example RADIX sort, among others.

The complexity of development is represented by the number of code lines per application as shown in the table below this lines. It is observed that SAS complexity is significantly lower than that of MP and this difference increases as applications are more irregular and dynamic in nature (almost doubles for Radix).
{| class="wikitable"
|-
!Appl.
!FFT
!OCEAN
!LU
!RADIX
!SAMPLE
!N-BODY
|- style="text-align: center;"
| MPI ||222||4320||470||384||479||1371
|- style="text-align: center;"
| SAS ||210 ||2878 ||309||201||450||950
|}

The results performance-wise indicated that SAS was only half efficiently dealing with parallelism for most of the applications in this study. The only application that showed similar performance for both methods (MP and SAS) was the LU application. The authors concluded that the overhead of the SVM protocol for maintaining page coherence and synchronization were the handicap of the easier SAS programming model.

===== Case Study - 2004 - Iosevich and Schuster =====

In 2004, [http://dl.acm.org/citation.cfm?id=1006252 Iosevich and Schuster] performed a study on two memory consistency models in a DSM, the '''sequential consistency''' (SC) model and a relaxed consistency model called '''home-based lazy release consistency''' (HLRC) protocol. The SC provides a less complex programming model, whereas the HLRC improves on running performance as it allows parallel memory access. Memory consistency models provide a specific set of rules for interfacing with memory.

The authors used a [http://www.cs.uga.edu/~dkl/6730/Fall02/Readings/millipage.pdf multiview] (MV) technique to ensure an efficient implementation of SC with fine-grain access to memory. The main advantage of this technique is that by mapping one physical region to several virtual regions, the system avoids fetching the whole content of the physical memory when there is a fault accessing a specific variable located in one of the virtual regions. One step further is the mechanism proposed by [http://onlinelibrary.wiley.com/doi/10.1002/spe.417/abstract Niv and Shuster] to dynamically change the granularity during runtime.
For this SC (with MV), only all page replicas need to be tracked, whereas for HLRC the tracking needed is much more complex. The advantage of HLRC is that write faults on read only pages are local, resulting in a lower cost for these operations.

This table summarizes the characteristics of the benchmark applications used for the measurements. In the Synch column, B represents barriers and L locks.

{| class="wikitable"
|-
!Application
! Input data set
! Shared memory
!Sharing granularity
! Synch
!Allocation pattern
|-
| Water-nsq|| 8000 molecules|| 5.35MB|| a molecule (672B)|| B, L|| fine
|-
| Water-sp|| 8000 molecules|| 10.15MB|| a molecule (680B)|| B, L|| fine
|-
| LU|| 3072 × 3072|| 72.10MB|| block (coarse)|| B|| coarse
|-
| FFT|| 2^20 numbers|| 48.25MB|| a row segment|| B|| coarse
|-
| TSP|| A graph of 32 cities|| 27.86MB|| a tour (276B)|| L|| fine
|-
| SOR|| 2066 × 10240|| 80.73MB|| a row (coarse)|| B|| coarse
|-
| Barnes-sp|| 32768 bodies|| 41.21MB|| body fields (4-32B)|| B, L|| fine
|-
| Radix|| 10240000 keys|| 82.73MB|| an integer (4B)|| B, L|| coarse
|-
| Volrend|| a file -head.den-|| 29.34MB|| a 4 × 4 box (4B)|| B, L|| fine
|-
| Ocean|| a 514 × 514 grid|| 94.75MB|| grid point (8B)|| B, L|| coarse
|-
| NBody|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine
|-
| NBodyW|| 32768 bodies|| 2.00MB|| a body (64B)|| B|| fine
|}

The authors found that the average speedup of the HLRC protocol is 5.97, and the average speedup of the SC (with MV) protocol is 4.5 if non-optimized allocation is used, and 6.2 if the granularity is changed dynamically.

===== Case Study - 2008 - Roy and Chaudhary =====

In 2008, [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.57.1319 Roy and Chaudhary] compared the communication requirements of three different page-based DSM systems ([http://www.cs.umd.edu/projects/cvm/ CVM], [http://www.cs.utah.edu/flux/quarks.html Quarks], and [http://ieeexplore.ieee.org/iel4/5737/15339/00709960.pdf?arnumber=709960 Strings]) that use virtual memory mechanisms to trap accesses to shared areas. Their study was also based on the SPLASH-2 suite of programs.
In their experiments, Quarks was running tasks on separate processes (it does not support more than one application thread per process) while CVM and Strings were run with multiple application threads.

{| class="wikitable"
|-
!Program
!CVM
!Quarks
!Strings
|- style="text-align: center;"
| FFT||1290||2419||1894
|- style="text-align: center;"
| LU-c||135||-||485
|- style="text-align: center;"
| LU-n||385||2873||407
|- style="text-align: center;"
| OCEAN-c||1955||15475||6676
|- style="text-align: center;"
| WATER-n2||2253||38438||10032
|- style="text-align: center;"
| WATER-sp||905||7568||1998
|- style="text-align: center;"
| MATMULT||290||1307||645
|- style="text-align: center;"
| SOR||247||7236||934
|}

The results indicate that Quarks generates a large quantity of messages in comparison with the other two systems. This can be observed in the table above. LU-c (contiguous version of LU) does not have a result for Quarks, as it was not possible to obtain results for the number of tasks (16) used in the experiment. This was due to the excessive number of collisions. In general, due to the lock related traffic, the performance of Quarks is quite low when compared to the other two systems for many of the application programs tested.
CVM improves on the lock management by allowing out of order access through a centralized lock manager. This makes the locking times for CVM much smaller than for others.
Nevertheless, the best performer is Strings. It wins in overall performance over the two other system compared, but there is room for improvement here too, as it was observed that the locking times in Strings are an elevated percentage of the overall computation.

The overall conclusion is that using multiple threads per node and optimized locking mechanisms (CVM-type) provides the best performance.

=== Evolution ===
A more recent version of a distributed shared memory system is vNUMA. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html Chapman and Heiser], describe vNUMA (where v is for virtual and NUMA is for [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non Uniform Memory Access]) as "a virtual machine that presents a cluster as a virtual shared-memory multiprocessor." The virtualization in vNUMA is a layer between the real CPUs that form the distributed shared memory system and the OS that runs on top it, usually Linux. The DSM in vNUMA is part of the hypervisor, which is the part of a virtual system that maps guest virtual addresses into real physical ones. The guest virtual addresses get then mapped into virtual memory addresses through the guest OS.

The difference with other virtual machines is that vNUMA runs one OS on top of several physical machines, whereas virtual machines often run several guest OS on top of one host OS that runs on one machine. And it uses DSM software techniques to present all the virtualized memory as a whole.

The DSM in vNUMA is a single-writer/multiple-reader write-invalidate protocol with sequential coherence. It is based on the IVY DSM, but it introduces several improvements to increase performance. The owner of a page can determine if the page needs to be sent by looking at the copyset (contains information of the set of nodes that maintain a copy of the page), avoiding several page faults and the manager becomes the owner of the copyset as soon as it is part of it. There are a couple other improvements: ''incremental deterministic merging'' and ''write-update-plus (WU+)''. ''Incremental deterministic merging'' uses sequence numbers to ensure that a location gets updated with the latest value and not intermediate, out of order writes. ''Write-update-plus (WU+)'' enforces single-writer for pages where atomic operations are done. vNUMA dynamically changes from multiple-writer to single-writer when atomic operations are detected.

In [http://communities.vmware.com/community/vmtn/cto/high-performance/blog/2011/09/19/vnuma-what-it-is-and-why-it-matters vNUMA what it is and why it matters], VMware presents vNUMA as part of vSphere, a virtualization platform oriented to build cloud computing frameworks.

==See also==
*[http://www.sgi.com/pdfs/4250.pdf Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory: The SGI® UV Architecture] SGI white paper.
*[http://www.scalemp.com/architecture#2 Versatile SMP (vSMP) Architecture]
*Adrian Moga, Michel Dubois, [http://www.sciencedirect.com/science/article/pii/S1383762108001136 "A comparative evaluation of hybrid distributed shared-memory systems,"] Journal of Systems Architecture, Volume 55, Issue 1, January 2009, Pages 43-52
*Jinbing Peng; Xiang Long; Limin Xiao; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=4708970&isnumber=4708921 "DVMM: A Distributed VMM for Supporting Single System Image on Clusters,"] Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for , vol., no., pp.183-188, 18-21 Nov. 2008

==References==
*Shan, H.; Singh, J.P.; Oliker, L.; Biswas, R.; , [http://escholarship.org/uc/item/76p9b40g#page-1 "Message passing vs. shared address space on a cluster of SMPs,"] Parallel and Distributed Processing Symposium., Proceedings 15th International , vol., no., pp.8 pp., Apr 2001
*Protic, J.; Tomasevic, M.; Milutinovic, V.; , [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=494605&isnumber=10721 "Distributed shared memory: concepts and systems,"] Parallel & Distributed Technology: Systems & Applications, IEEE , vol.4, no.2, pp.63-71, Summer 1996
*Chandola, V. , [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5206&rep=rep1&type=pdf "Design Issues in Implementation of Distributed Shared Memory in User Space,"]
*Nitzberg, B.; Lo, V. , [http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf "Distributed Shared Memory: A Survey of Issues and Algorithms"]
*Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, [http://dl.acm.org/citation.cfm?id=223990 "The SPLASH-2 programs: characterization and methodological considerations,"] Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
*Jegou, Y. , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=1199404 Implementation of page management in Mome, a user-level DSM] Cluster Computing and the Grid, 2003. Proceedings. CCGrid 2003. 3rd IEEE/ACM International Symposium on, p.479-486, 21 May 2003, IRISA/INRIA, France
*Hennessy, J.; Heinrich, M.; Gupta, A.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=747863 "Cache-Coherent Distributed Shared Memory: Perspectives on Its Development and Future Chanllenges,"] Proceedings of the IEEE, Volume: 87 Issue:3, pp.418 - 429, Mar 1999, Comput. Syst. Lab., Stanford Univ., CA
*J. Protic, M. Tomasevic, and V. Milutinovic, [http://media.wiley.com/product_data/excerpt/76/08186773/0818677376-2.pdf An Overview of Distributed Shared Memory]
*Yoon, M.; Malek, M.; , [http://www.cs.utexas.edu/ftp/techreports/tr94-21.pdf "Configurable Shared Virtual Memory for Parallel Computing"] University of Texas Technical Report tr94-21, July 15 1994, Department of Electrical and Computer Engineering, The University of Texas at Austin
*Dubnicki, C.; Iftode, L.; Felten, E.W.; Kai Li; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=508084 "Software Support for Virtual Memory-Mapped Communication"] Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International, pp.372 - 381, 15-19 Apr 1996, Dept. of Comput. Sci., Princeton Univ., NJ
*Dubnicki, C.; Bilas, A.; Li, K.; Philbin, J.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=580931 "Design and Implementation of Virtual Memory-Mapped Communication on Myrinet"] Parallel Processing Symposium, 1997. Proceedings., 11th International, pp.388 - 396, 1-5 Apr 1997, Princeton Univ., NJ
*Kranz, D.; Johnson, K.; Agarwal, A.; Kubiatowicz, J.; Lim, B.; , [http://delivery.acm.org.prox.lib.ncsu.edu/10.1145/160000/155338/p54-kranz.pdf?ip=152.1.24.251&acc=ACTIVE%20SERVICE&CFID=81831968&CFTOKEN=62928147&__acm__=1327853249_029db7e958cb50bd47056f939f3296f7 "Integrating Message-Passing and Shared-Memory: Early Experience"] PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.54 - 63
*Amza, C.; Cox, A.L.; Dwarkadas, S.; Keleher, P.; Honghui Lu; Rajamony, R.; Weimin Yu; Zwaenepoel, W.; , [http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=485843&tag=1 "TreadMarks: shared memory computing on networks of workstations"] Computer, Volume: 29 Issue:2, pp. 18 - 28, Feb 1996, Dept. of Comput. Sci., Rice Univ., Houston, TX
*David R. Cheriton. 1985. [http://doi.acm.org.prox.lib.ncsu.edu/10.1145/858336.858338 Preliminary thoughts on problem-oriented shared memory: a decentralized approach to distributed systems.] SIGOPS Oper. Syst. Rev. 19, 4 (October 1985), 26-33.
*Matthew Chapman and Gernot Heiser. 2009. [http://www.usenix.org/event/usenix09/tech/full_papers/chapman/chapman_html/index.html vNUMA: a virtual shared-memory multiprocessor.] In Proceedings of the 2009 conference on USENIX Annual technical conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2-2.

== Quiz ==
The memory hierarchy described for the CSVS system places remote memories:
# Between main memory and local disk storage
# Same hierarchy as local disk storage
# Below local disk storage

When messages are sent by the OS to retrieve non-local data the virtual address of the retrieved data is translated to physical:
# At the origin of the message, i.e. where the page fault occurs
# By the DSM system default manager
# At the location where the desired page resides

DSM nodes
# partially map variable amounts of their memory to the distributed address space
# are configured to supply a contiguous and fixed amount of memory to the distributed address space
# utilize I/O to access the entirely non-local distributed address space

The SAS programming model:
# Has evolved beyond MP as it is difficult to program in scalable DSM environments
# Utilize MP to communicate but rely on the ease of a common address space
# Has suffered too many security problems, scalable MP now dominates the landscape

Page management in MOME:
# Requires consistent address space mapping across all nodes
# Is managed from a global DSM perspective
# Allows an F and V page descriptor to occur for the same page on the same node

The most adopted DSM algorithm is:
# Single Reader/ Single Writer (SRSW)
# Multiple Readers/ Single Writer (MRSW)
# Multiple Readers/Multiple Writers (MRMW)

In Sequential Consistency:
# all processors see the same ordering of memory references, and these are issued in sequences by the individual processors
# the order of writes is observed by every individual processor in the system, but the order of reads does not need to be in sequence
# consistency is required only on synchronization accesses

SPLASH is a
# coherence protocol
# collection of parallel programs engineered for the evaluation of shared address space machines
# DSM implementation

The complexity of development is
# the same for MP and SAS
# lower for SAS
# lower for MP

vNUMA is a
# Fast Fourier Transform implementation
# network implementation
# virtual machine that presents a cluster as a virtual shared-memory multiprocessor