ECE506 CSC/ECE 506 Spring 2013/11a ad: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(48 intermediate revisions by the same user not shown)
Line 1: Line 1:
NOTE: This is a modified wiki.All references from the Sohlin's book and class notes have been avoided.This is a standalone resource which condenses data from research papers,hence requires previous knowledge of DSM alongwith some idea of performance issues faced by designers.
=Introduction=
=Introduction=
When dealing with a relatively small number of processors (8-16), according to Solihin 320, using a bus based shared memory structure is fine.  Unfortunately, when you need to provide a shared memory structure for processors much greater than that, you will need a different set of organization. This new organization is needed due to the physical limitations of the bus.  There are two ways you can create such a system. These include [http://en.wikipedia.org/wiki/Distributed_shared_memory Distributed Shared Memory] (DSM) or [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non-Uniform Memory Access] (NUMA).  The benefits of having a DSM and NUMA is that we can now scale to a larger amount of processors.  The disadvantage is that scaling in such a way may not be the most cost-effective solution, Solihin 320.  For the remainder of this section, we will be discussing the performance of DSM's.
Distributed shared memory received much attention because it offers the power of
parallel computing using multiple processors as well as a single system memory view which
makes the programming task easy.
 
We start by classifying DSM's on basis of implementations and characteristics.Software and Hardware DSM's face different performance issues hence we need to figure out the limitations of the system.
Memory Consistency,Coherence in a distributed shared memory system are important issues because there might
be some potential problems when different processors and caches are used to update
shared single memory space. In order to improve performance and get correct result of computation, distributed shared memory systems designers should choose the proper paradigm of memory coherence semantics ,consistency protocols .<ref>http://crystal.uta.edu/~kumar/cse6306/papers/Chingwen.pdf</ref>


Scalability,Granularity come next as these systems consist of a collection of independent computers connected by
a high-speed interconnection network. If designers choose the network topology
carefully, the system can scale to a large number of nodes.


According to Solihin 320, there are two aspects that restrict the scalability of bus-based multiprocessors.  These include the physical limitations of interconnections and the limitations of the protocol.  To explain in detail, on a bus-based system, adding a processor will not affect any other physical restrictions on the system.  Unfortunately, when adding a new processor, you will be reducing the speed of the bus.  Second, the protocol needed to keep coherence does not scale well.  As you increase the number of processors to the system, the amount of traffic also increases.  This means that you might run the risk of overwhelming the bandwith.  According to Solihin, there are a few ways that we can mitigate this problem.  The following is from 321 of the Solihin textbook.


DSM’s are also concerned with the interconnection network that provide the data to the requesting processor in an
efficient and timely fashion. Both the bandwidth (amount of data that can be supplied in a
unit time) and latency (the time it takes to receive the first piece of requested data from
the time the request isissued) are important to the design of DSM’s. Precisely because of
the generally longer latencies encountered in large scale DSM’s, multithreading has
received considerable attention; multithreading model can be utilized to tolerate (or
mask) memory latencies.
<ref>http://web.engr.oregonstate.edu/~benl/Publications/Book_Chapters/Advances_in_Computers_DSM00.pdf</ref>


<center>[[Image:ProtocolVSinterconnection.png|400px| Multiple Caches of Shared Resource]]</center>
After looking into SMP model parameters which are intrinsic to DSM architechture we move on to external factors such as API and Memory organization.Major improvements can be achieved if the above parameters are taken into account when designing a Distributed Memory Architecture system.
<center> '''Figure 1. Ways to Scale Multiprocessors''' </center>


Implementation of DSM is taken up next and explained in detail alongwith benchmarks.This helps in understanding the benefits of keeping the above discussed parameters in mind when tweaking the designs based on available information.Future work on this can include interrelation among various parameters and the need to come up with a better technique to figure out performance meterics for a DSM based system.


From the table, we can see that there is three ways to scale a multiprocessor system.  The first being a single bus system.  This is the least scalable due to the limitations of the bus wire itself.  As you add processors you will decrease the bus speed due to having to increase the wire length.  Also, you run into an issue of overwhelming the bus due to the amount of traffic.  The second way is to use a point-to-point bus system.  This allows for the speed of the bus to remain relatively fast, but since the traffic will also scale with the number of processors, there will be a limitation due to overwhelming the bus system with traffic.  Lastly, the most scalable system to date is using a directory system.  This allows for the bus to remain fast due to the short wires, and the bus traffic to remain low since the directory holds information on cache locations.


<center>[[Image:Dsm1.jpg]]</center>
<center> '''Figure.  DSM Designs'''</center>


=DSM Classification=
=DSM Classification<ref>http://www.cs.rit.edu/~pns6910/docs/Distributed%20Shared%20Memory%20Systems/A%20survey%20of%20distributed%20shared%20memory%20systems.pdf</ref>=
To provide an overview  
<blockquote>To provide an overview  
of DSM, all possible platforms must be considered in DSM design.The choice  
of DSM, all possible platforms must be considered in DSM design.The choice  
relies on classifying all existing systems  
relies on classifying all existing systems  
into appropriate non-overlapping subsets of systems.
into appropriate non-overlapping subsets of systems.
DSM implementation level Types-
DSM implementation level types-</blockquote>
  1 Hardware  
  1 Hardware  
  2 Software  
  2 Software  
Line 41: Line 61:
DSM mechanism is usually organized as a set of clusters  
DSM mechanism is usually organized as a set of clusters  
connected by an interconnection network, architectural parameters include:</blockquote>
connected by an interconnection network, architectural parameters include:</blockquote>
  <blockquote>a) Cluster configuration (single/multiple processors, with/without, shared/private, single/multiple level caches, etc.)
  a) Cluster configuration (single/multiple processors, with/without, shared/private, single/multiple level caches, etc.)
            b) Interconnection network (bus hierarchy, ring, mesh, hyper- cube, specific LAN, etc.) </blockquote>
b) Interconnection network (bus hierarchy, ring, mesh, hyper- cube, specific LAN, etc.)


<blockquote>Cluster configuration is usually very important for the hardware-oriented proposals that integrate the mechanisms of cache  
<blockquote>Cluster configuration is usually very important for the hardware-oriented proposals that integrate the mechanisms of cache  
Line 59: Line 79:
DSM algorithm, since it affects the possibility and cost of broad-  
DSM algorithm, since it affects the possibility and cost of broad-  
cast and multicast transactions. </blockquote>
cast and multicast transactions. </blockquote>
<blockquote>Shared data organization represents the global layout of
 
shared address space, as well as the size and organization of data
<blockquote>The impact of organization to the overall system performance is closely related to the locality of data access.  
items in it, and can be distinguished as:</blockquote> <blockquote>
Hardware solutions always deal with non-structured data objects  
a) Structure of shared data (non structured or structured into
objects, language types. etc.) </blockquote><blockquote>
b) Granularity of coherence unit (word, cache block, page,
complex data structure, etc.) </blockquote>
<blockquote>
The impact of this organization to the overall system performance is closely related to the locality of data access.  
Hardware solutions always &al with non-structured data objects  
(typically cache blocks), while many software implementations  
(typically cache blocks), while many software implementations  
tend to use data items that represent logical entities, in order to  
tend to use data items that represent logical entities, in order to  
Line 76: Line 89:
blocks (pages), counting on the coarse-grain sharing.</blockquote>
blocks (pages), counting on the coarse-grain sharing.</blockquote>


==Software Support==
==Software DSM<ref>http://csag.ucsd.edu/individual/yskee/publication/parco04.pdf</ref>==
In 1986, the first software supported DSM was created. Since then, it has been well over 20 years and there have been great improvement upon the first initial system. First, it is usually the case that the software support will find some way to relax the memory consistency model. This is due to the fact that memory passing on a DSM is much more expensive than message passing on a single bus system. Over the last 20 years, over 20 different memory consistency models have been proposed <ref name="shi"></ref>. Second, cache coherence must be addressed. Having multiple cache copies means that when one copy is updated the other cache copies should be affected in some way such that the old values are not used. Traditionally there are two techniques, one being snoopy protocol and the second being directory based protocol. According to Shi <ref name="shi"></ref>, snoopy protocol is less used due the fact that it requires hardware support. Lastly, according to Shi <ref name="shi"></ref> the major problem is the interfaceFor a DSM system to be competitive, it has to be able to work for many customers.  Below is a listing of some representative software DSM implementations.
<blockquote>Software distributed shared memory (SDSM) systems which provide shared address space have been of great importance to distributed memory architectures.
Early SDSM systems like IVY [1], Midway [2], Munin [3], and TreadMarks [4]
assume uniprocessor nodes, thus allow only one thread per process on a node.
Currently, commodity off-the-shelf microprocessors and network components are
widely used as building blocks for parallel computers. This trend has made cluster
systems consisting of symmetric multiprocessors (SMPs) attractive platforms for
high performance computing. However, the early single-threaded SDSM systems
are too restricted to exploit multiprocessors in SMP clusters. The next generation
SDSM systems like Quarks [5], Brazos [6], DSM-Threads [7], and Murks [8] are
aware of multiprocessors and exploit them by means of multiprocesses or multithreads. In general, naive process-based systems experience high context switchingoverhead and additional inter-process communication delay within a node, so the focus is on multi-threaded SDSM systems.
Many single-threaded SDSM systems are implemented at user-level by using the
page fault handling mechanisms.</blockquote>
<br>
 
<blockquote>The SDSM system faces a dilemma when multiple threads compete to access an invalid
page within a short interval. On the first access to an invalid page, the system should
set the page writable to replace with a valid one. Unfortunately, this change also allows other application threads to access the same page freely. This phenomenon is known as atomic page update and change right problem [7] or a race condition [8].
A known solution to this problem adopted by major multithreaded SDSM systems
like TreadMarks [9], Brazos [6], and Strings [10] is to map a file to two different virtual addresses. Even though the file mapping method achieves good performance
on some systems, file mapping is not always the best solution. Operating system
and working environment may severely affect the performance of these systems.
The file mapping method performs poorly in some cases; for example, an IBM SP
Night Hawk system
.Moreover, file
mapping has high initialization cost and reduces the available address space because SDSM and application partition the address space.
We note the cause of the atomic page update problem is that SDSM and application
share the same address space. When SDSM changes a page writable, the page is
also accessible to the application. A general solution to this problem is to separate
the application address space from the system address space for the same physical
memory, and to assign different access permission to each address space.
</blockquote>
 
==Hardware Support<ref>http://www.csl.cornell.edu/~espeight/papers/CSL-TR-2000-1008.pdf</ref>==
 
<blockquote>
Scalable cache-coherent distributed shared-memory (DSM) machines have received much attention in the literature since the late 1980s. To demonstrate their effectiveness, several cache-coherent non-uniform memory access (CC-NUMA) hardware DSM machines were built in the research com- munity (e.g. DASH [26], Alewife [2],) and commercial machines followed
(e.g. SGI Origin 2000 [23], Sequent NUMA-Q [29], HP Exemplar [1],).
  At the same time, a large research effort produced a set of scientific benchmarks with which to evaluate DSM machines [48]. Most high-performance hardware DSM machines have tightly-integrated node or memory controllers that connect the microprocessor both to the memory system and to a proprietary high-speed
switching network. The scalable coherence protocols used in such machines are implemented either in hardware finite-state machines or in software running on an embedded
programmable device in the controller. Despite the resulting high performance of these systems, and efforts to show that the necessary additional hardware to support hardware DSM in commodity workstations and servers is small [25], high-end PC servers and engineering workstations have yet to integrate the additional functionality needed to build seamless hardware DSM from COTS
(commodity of-the-shelf) components.</blockquote>
 
= Performance Concerns =


A DSM machine has unique requirements compared to shared memory / bus based machines in order to provide cache coherence and memory consistency, as well as having interconnects. Performance concerns of each of these concepts in DSM machines are discussed in detail.
== Cache coherence ==
<blockquote>The presence of multiple cached copies of a page requires a mechanism to notify other sharers
of a modified memory location. A new value is propagated by either invalidating or updating
each copy, which correspond to write-invalidate and write-update coherence protocol respectively. Generally, the coherence protocol consists of the data structures and algorithms used
to implement a given memory consistency model.</blockquote>
<blockquote>
A cache coherence protocol is in fact a mechanism of propagating newly written values so
that all processors have a coherent view of the shared memory. It is normally designed under
the requirement of a memory consistency model which specifies the coherence requirement on
a cache coherence protocol, i.e., what “coherent view” of shared memory a cache coherence
protocol should provide. Many new ideas accompany with relaxed memory consistency models
have been proposed to efficiently propagate newly written values. Examples of these new ideas
include lazy release consistency, and delay consistency etc.</blockquote>
<blockquote>
Traditionally, there are two main methods to maintain coherence: snoopy protocol and
directory-based protocol. The snoopy coherence protocol requires the support of hardware so
it is not widely used in software DSM systems. Up to now, almost all software DSM systems
adopt the directory scheme or that similar to directory scheme. However, the scalability of
the directory-based scheme is limited by the organization mode of the directory. Therefore, we
propose a new scheme to maintain the coherence between multiple cache copies.</blockquote>


<center>[[Image:SoftwareDSM.jpg|600px| Software DSM]]</center>
== Memory consistency <ref>http://www.cs.wayne.edu/~weisong/papers/shi00-jiajia.pdf</ref>==
<center> '''Figure 3Representative Software DSM Implementations''' <ref name="shi"></ref></center>
<blockquote>The Memory consistency model is an interface between the programmer and the system.
The memory consistency model of a shared memory system formally specifies how the memory
system will appear to the programmer. Normally, a memory consistency model defines constraints on the order in which memory accesses can be performed in shared memory systems.
It influences both programming complexity and performance. The stricter the memory consistency model, the easier for programmer to program, and the less opportunity for optimization.
A strict consistency model like sequential consistency is intuitive to the programmer. However, with the large granularity of coherence unit (a page) in shared virtual memory systems,
the false sharing problem will be so serious that the performance of software DSM systems
under sequential consistency is very poor. For example, the performance of the first software
DSM system IVY is so poor that the main contribution of it is the original idea about software
DSM, while the practical system is useless.
To improve performance, software DSM systems normally adopt relaxed memory consistency models which separate synchronization operations from ordinary load and store operations
and allow ordinary operations to be executed out of order. The propagation and application of
coherence operations are postponed until synchronization points. In the past decade, almost
20 different memory consistency models have been proposed for hardware-coherent systems.
Among those relaxed consistency models, release consistency which separates acquire
from release synchronization inspire a major breakthrough in the performance of software DSM
systems.</blockquote>
<blockquote>
Although the memory consistency model specifies when coherence operations and data
need to become visible, it can actually be implemented with various degrees of “laziness” in
the propagation and application of both coherence and data. Greater laziness implies greater
complexity and protocol and state, but fewer communication and protocol operations. For
example, hardware-coherent systems that use release consistency tend to propagate coherence
and apply them immediately, thus simplifying the data structures that need to memorize the
state of the cache line.</blockquote>
<blockquote>
In software DSMs, it is very important to reduce the number of messages exchanged, because
sending a message in a software DSM is more expensive than that in a hardware DSM.</blockquote>
<blockquote>
The TreadMarks’ lazy implementation of release consistency goes further. It does not
propagate the modifications in a critical section at the time of release. Instead, modifications
are buffered and are propagated merely to the processor that acquires the released lock until
the time of acquire. In this way, lazy release consistency reduces both the number of messages
and the amount of data exchanged. In LRC, before a processor can pass an acquire operation,
all modifications that have been visible to the releasing processor must also be visible to the
acquiring processor.</blockquote>




==Hardware Support==
== Interconnection Latency ==


<blockquote>A mentioned in the cache coherency and memory consistency sections, interconnections unique distinguish DSM systems from bus based systems.  Interconnections are unlike a bus in that they do not guarantee that messages reach recipients, and certainly aren't seen by the receivers at the same moment. Each message must be sent as a transaction in a networking protocol, and each packet sent has at least the latency of hops through routers in the network, and can also incur latency in being generated.</blockquote>
<blockquote>
Since messages can become ubiquitous if the DSM system was naively designed to perform identically to a bus based system, care must be taken to design coherence protocols and consistency models that minimize the sending and receiving of messages, and if they must be sent or received, allow for overlapping of execution without blocking while waiting for messages or their receipt.</blockquote>


Although a lot of research has been towards software support for DSM, there has been some research in adding some hardware support. Unfortunately, according to Shi <ref name="shi"></ref> there is a rejection of hardware support from large corporations. What will occur when using hardware support is a issue of compatibility. Fortunately, recent adoptions of certain hardware standards will allow for some hardware support on the mass level.
==Granularity<ref>http://www.cdk5.net/dsm/Ed4/Chapter%2018%20DSM.pdf</ref>==
<blockquote>An issue that is related to the structure of DSM is the granularity of sharing.
Conceptually, all processes share the entire contents of a DSM. As programs sharing
DSM execute, however, only certain parts of the data are actually shared and then only
for certain times during the execution. It would clearly be very wasteful for the DSM
implementation always to transmit the entire contents of DSM as processes access and
update it. What should be the unit of sharing in a DSM implementation? That is, when
a process has written to DSM, which data does the DSM runtime send in order to provide
consistent values elsewhere?</blockquote>
<blockquote>The focus here is on page-based implementations, although the granularity issue
does arise in other implementations. In a page-based DSM, the
hardware supports alterations to an address space efficiently in units of pages –
essentially by the placement of a new page frame pointer in the page table (see, for
example, Bacon [2002] for a description of paging). Page sizes can typically range up
to 8 kilobytes, so this is an appreciable amount of data that must be transmitted over a
network to keep remote copies consistent when an update occurs. By default, the price
of the whole page transfer must be paid whether the entire page has been updated, or just
one byte of it.</blockquote>
<blockquote>Using a smaller page size does not necessarily lead
to an improvement in overall performance. First, in cases where processes do update
large amounts of contiguous data, it is better to send one large page rather than several
smaller pages in separate updates, because of the fixed software overheads per network
packet. Second, using a small page as the unit of distribution leads to a large number of
units that must be administered separately by the DSM implementation.
To complicate matters further, processes tend to contend more for pages when the
page size is large, because the likelihood that the data they access will lie within the
same page increases with the page size. Consider, for example, two processes, one of
which accesses only data item A while the other accesses only data item B, which lie
within the same page. For the sake of concreteness, let us assume that one
process reads A and the other updates B. There is no contention at the application level.
However, the entire page must be transmitted between the processes, since the DSM
runtime does not by default know which locations in the page have been altered. This
phenomenon is known as false sharing: two or more processes share parts of a page, but
only one in fact accesses each part. In write-invalidate protocols, false sharing can lead
to unnecessary invalidations. In write-update protocols, when several writers falsely
share data items they may cause them to be overwritten with older versions.
In practice, the choice of the unit of sharing has to be made based on the physical
page sizes available, although a unit of several contiguous pages may be taken if the
page size is small. The layout of data with respect to page boundaries is an important
factor in determining the number of page transfers made when a program executes.
</blockquote>


<br>
==Thrashing==
<blockquote>
A potential problem with write-invalidate protocols is thrashing. Thrashing is said to
occur where the DSM runtime spends an inordinate amount of time invalidating and
transferring shared data compared with the time spent by application processes doing
useful work. It occurs when several processes compete for the same data item, or for
falsely shared data items. If, for example, one process repeatedly reads a data item that
another is regularly updating, then this item will be constantly transferred from the
writer and invalidated at the reader. This is an example of a sharing pattern for which
write-invalidate is inappropriate and write-update would be better. The next section
describes the Mirage approach to thrashing, in which computers ‘own’ pages for a
minimum period; Section 18.4 describes how Munin allows the programmer to declare
access patterns to the DSM system so that it can choose appropriate update options for
each data item and avoid thrashing.
</blockquote>


11a.  Performance of DSM systems.  Distributed shared memory systems combine the programming models of shared memory systems, and the scalability of distributed systems.  However, since DSM systems need extra coordination between software layer and underlying hardware, achieving good performance could be a big challenge. The factors that harm the performance could be the overhead to maintain cache coherence, memory consistency, and the latency of interconnections. Please further explore the factors that can affect the performance of DSM systems, and the improvements that have been made on the existing systems.
==Scalabilty<ref>http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf</ref>==
<blockquote>A theoretical benefit of  
DSM systems is that they scale better
than tightly coupled shared-memory  
multiprocessors. The limits of scalability are greatly reduced by two factors: central bottlenecks (such as the  
bus of a tightly coupled shared-
memory multiprocessor), and global
common knowledge operations and
storage (such as broadcast messages or
full directories, whose sizes are proportional to the number of nodes).  
Li and Hudak2 went through several
iterations to refine a coherence protocol for Ivy before arriving at their dy-
namic distributed-manager algorithm,  
which avoids centralized bottlenecks.</blockquote>


= Performance Concerns =
<blockquote>However, Ivy and most other DSM
systems are currently implemented on
top of Ethernet (itself a centralized
bottleneck), which can support only
about 100 nodes at a time. This limitation is most likely a result of these
systems being research tools rather than
an indication of any real design flaw.
ShivaY is an implementation of DSM
on an Intel iPSCR hypercube, and it
should scale nicely. Nodes in the Dash
system are connected on two meshes.
This implies that the machine should
be expandable, but the Dash prototype is currently limited by its use of a
full bit vector (one bit per node) to
keep track of page replication.
Heterogeneity. At first glance, sharing memory between two machines with
different architectures seems almost
impossible. The machines may not even
use the same representation for basic
data types (integers, floating-point
numbers, and so on). It is a bit easier if
the DSM system is structured as variables or objects in the source language.
Then a DSM compiler can add conversion routines to all accesses to shared
memory. In Agora, memory is structured as objects shared among heterogeneous machines.
Mermaidlo explores another novel
approach: Memory is shared in pages,
and a page can contain only one type of
data. Whenever a page is moved between two architecturally different systems, a conversion routine converts
the data in the page to the appropriate
format. </blockquote>
<blockquote>
Although heterogeneous DSM might
allow more machines to participate in
a computation, the overhead of conversion seems to outweigh the benefits.
Related algorithms.
To support a DSM
system, synchronization operations and
memory management must be specially
tuned. Semaphores, for example, are
typically implemented on shared-
memory systems by using spin locks. In
a DSM system, a spin lock can easily
cause thrashing, because multiple nodes
may heavily access shared data. For
better performance, some systems pro-
vide specialized synchronization primitives along with DSM. Clouds provides
semaphore operations by grouping
semaphores into centrally managed segments. Munin supports the synchroni-
zation memory type with distributed
locks. Plus supplies a variety of syn-
chronization instructions, and supports
delayed execution, in which the syn-
chronization can be initiated, then later
tested for successful completion.
Memory management can be restructured for DSM. A typical memory-
allocation scheme (as in the C library
malloc()) allocates memory out of a
common pool, which is searched each
time a request is made. A linear search
of all shared memory can be expensive.
A better approach is to partition avail-
able memory into private buffers on
each node and allocate memory from
the global buffer space only when the
private buffer is empty.</blockquote>


A DSM machine has unique requirements compared to shared memory / bus based machines in order to provide cache coherence and memory consistency, as well as having interconnects. Performance concerns of each of these concepts in DSM machines are discussed in detail.
=Other Factors=
==Application Programming Interface==
<blockquote>As we know, whether a new software system is competitive depends greatly on the friendship
of it’s application programming interface(API). In parallel computing field, shared memory and
message passing are two mainstream programming model which are widely used in the world.
Shared memory programming model is a natural extension of traditional sequential programming
model so that it is easy to be accepted by the application programmers. In this model, the
application programmer does not need to consider the data partition, migration and underlying communication mechanism. On the contrary, the message passing programming model
requires the application programmer to take the data partition into account, and use system
supplied communication functions explicitly, which burden the application programmer much.
As a result, the shared memory model is widely adopted and advocated by the researchers,
application programmers and corporations, which fuels the success of shared virtual memory
system.</blockquote>
<blockquote>
Generally speaking, the programming model of software DSM system is similar to traditional
shared memory systems, such as Single Program and Multiple Data (SPMD) and Multithreading, which are familiar with many programmers. However, the API of a software DSM system
is closely related to the memory consistency model implemented in the system and the implementation level of the system. For example, in a software DSM system with sequential
consistency, the programmer can write the program just like a general SPMD programming
mode. They can allocate a shared variable at any time and use them easily. However, if the
Entry Consistency is used, the application programmer must mark the affinity between
shared data and synchronization object explicitly in the program.</blockquote>
<blockquote>
The implement level of software DSM system will affect the programming interface too. If
the system is implemented on the language level, such as Linda and Orca, the programmer
must learn those new characteristics in the languages related to shared memory address space.
If the system is implement on the complier or operating system level, all the work are done
by the compiler and all the changes are transparent to application users, as in Shasta[100]. If
the software DSM system is implemented by runtime library, what the application programmer
should do is adding some function calls in the source program, and linking the related library
when compiling.</blockquote>


== Maintaining cache coherence ==
==Memory Organization==
<blockquote>The memory organization of a software DSM system determines the way shared virtual memory
is organized on top of the isolated memory address space. It has great influence on the cache
coherence protocol and consequently on system overhead.</blockquote>
<blockquote>
Normally, the shared virtual memory address space of a software DSM system is organized
in a COMA-like way, where the local memory of a host is treated as a large cache, and pages can
be replicated and migrated on demand. Migrating the owner of data according to the sharing
pattern reduces the probability of access miss for some applications characterized with single
writer reference pattern, but requires a mechanism to locate the data when local access is failed.
In systems that adopt the COMA shared virtual memory organization scheme, each page has
an owner, and mechanisms such as probable owner or approximate copyset are employed to
find where the owner of the faulting page is when the page fault occurs. Examples of this kind
of system includes IVY, Munin, TreadMarks, CVM, and Quarks.</blockquote>
<blockquote>
The shared virtual address space of a software DSM can also be organized like traditional
CC-NUMA. In this kind of system, each shared page has a fixed home, when the page faulting
occurs, the faulting processor can fetch the up-to-date page from the home directly, only one
round-trip communication is needed. This requires that the coherence protocol to propagate
modified information to the home node on time. The main advantage of the home-based protocol is its very low memory overhead compared to homeless counterpart. Besides the simplicity
of servicing ordinary page fault, another advantage of the home-based software DSM system is
that no diffs generation are required for writes to home pages. Evaluation results from Zhou  demonstrate that home-based system have comparable or better performance than
its COMA counterparts, though platform dependent evaluation make the evaluation results not so convincible.
Some other memory organization schemes which stem from COMA and NUMA have been
implemented and studied in DSM systems. The I-ACOMA and simple-COMA are examples of these varieties.
</blockquote>


To maintain correct cache coherence, write propagation and write serialization must be provided, both of which can have adverse effects on performance.
= Performance Improvements =


Write serialization requires that all writes to a memory location be seen in the same order by other processors. Earlier, an example was given indicating how write serialization can be violated by observing writes out of order. A naive implementation of write serialization would require that a request and all it's messages are performed atomically to avoid overlapping of requests [1, p. 338]. Solihin [1, p. 342-344] discusses correctness issues that can occur if requests are allowed to overlap without special consideration.  A non-overlapping approach would require that each request has conditions defined that indicate when it was begun and when it ends, in order for the home node to observe and wait for completion prior to processing other requests to the same block.


The performance concern of disallowing overlapping of requests is that subsequent read or write operations to the same block would be delayed from initiating, even if some of the messages within the requests can be overlapped without correctness concerns.


Another performance problem that can arise through cache coherence are false sharing misses. False sharing can be explained by an example. Suppose two processors have a cache block caches in the shared state, but processor A is reading and writing to a variable x within this block, and processor B is reading and writing to a variable y within this block. Although both processors are not attempting to access each others variables, since both variables map to the same cache block, then each processor either invalidates or sends updates to the other <i>without the other actually needing the data</i>. in a DSM system, these invalidations or updates can unnecessarily utilize a significant amount of interconnect bandwidth.
== Cray X1 Supercomputer ==
<blockquote>The Cray X1 combines the globally-addressable,
distributed shared memory architecture with
high memory and network interconnect bandwidth. In order to sustain high bandwidth vector processing, the X1
is based on previous MPP Cray designs that emphasized
memory bandwidth, as well as more recent vector concepts such as multi-streaming and vector caching. The
system uses a network interconnect reminiscent of the
Cray T3E to connect Cray nodes in order to unite long,
latency-tolerant vector computations with the scalability
to be expected from MPPs.
<center>[[Image:X1arch.gif ]]</center>
<center> '''Figure. Cray X1 architechture''' </center>
Figure above  illustrates the architecture of a single Cray X1
node, the basic building block of the system.
Each MSP contains 4 SSPs
each with 2 vector and 1 scalar unit
consists of four multi-streaming processors (MSPs) and a
flat, shared 16GB physical memory. Each MSP in turn
is composed of four single-streaming processors (SSPs),
each with two vector pipelines and one scalar processor.
The four SSPs also share a 2MB data “E-Cache”, which
helps supply enough memory bandwidth to saturate the
vector units. As is the case with many vector platforms,
applications whose critical paths do not vectorize tend
to exhibit poor performance; in addition to operating at
twice the clock speed, the ability of the vector units to
overlap memory operations with computation makes the
Cray X1’s vector units significantly more powerful than
the scalar pipeline. The X1 offers two configurations for
program execution. Explicit parallelism is achieved in the
SSP mode by treating each SSP as a separate processor,
such that the node essentially behaves as a 16-way SMP.
The alternative MSP mode maps each execution thread
to an MSP, and utilizes compiler-directed multi-streaming
transformations to accomplish automatic parallelization
across the constituent SSP hardware. The multi-streaming
process divides either vectorized inner loops or unvectorized outer loops into four independent segments, and assigns them to different SSPs to be executed in parallel.
An early performance evaluation of the Cray X1 suggests that many parallel applications can achieve significant performance on the machine, given sufficient porting
and optimization efforts.</blockquote>  


== Maintaining memory consistency ==
<blockquote>Fast interconnect between nodes
– for up to 128 nodes, a 4D hypercube
– for more nodes (up to 1024), “enhanced” 3D torus
- Synchronization via atomic memory operations
– for locks, barriers, etc.
- One MSP acts like an SMP,, p but each processor can also
  directly address memory on another MSP
– remote accesses go over the network, and bypass local cache
– use VM address translation to turn 64-bit virtual address into
-  physical address with node number and local 36-bit physical
  address
- remote access not good for single read/write (high latency, no
caching ), but good for block put/get operations (one-sided)
interconnect, compared to other contemporary DSM machines.</blockquote>


Depending on the memory consistency model that is being enforced, performance can be lost be having to ensure various degrees of atomicity of memory accesses and program ordering.
<blockquote>Benchmarks show good performance from high bandwidth and
relatively low latency to local and remote memory, fast
interconnect, compared to other contemporary DSM machines.<ref>http://www.cs.umd.edu/class/spring2010/cmsc714/Lectures/tera-cray-lect10.pdf</ref></blockquote>


For sequential consistency, to follow program order requires that the program executes statements in the order defined by the source code for a thread. The implication is that statements within a thread cannot be executed out of order, so compiler optimizations and processor optimizations that include out of order execution as an attempt to reduce the latency of individual instructions and increase instruction level parallelism on pipelined architectures must be avoided to varying degrees. [1, p. 293]
===Benchmarks and applications Performance<ref>http://www.csm.ornl.gov/~worley/papers/CUG03.WorleyDunigan.X1.pdf</ref>===
<b>HALO</b>
Performance comparison of different MPI protocols for exchange.The optimal implementation for all halo sizes uses a
persistent MPI exchange protocol.


For atomicity of memory accesses to be seen by all processors, special considerations need to be made for DSM systems. In general, all processors must be able to detect when a load or a store has completed. For a store, or write atomicity, on a bus based system, the completion can be assumed as occurring as soon as a read exclusive request reach the bus. This is because all processors on the bus are guaranteed to see the request and will immediately invalidate themselves, resulting in the next read to the location becoming a read miss, requiring the block to be re-cached with the most up to date value. On a DSM system, however, there is no bus, so a write cannot be assumed complete as soon as invalidations are sent on the network. Rather, they must wait until acknowledgements of the invalidations are received which can take many hops and incur high latency, especially if there is network congestion. [1, p.292]
<center>[[Image:X1.jpg|Halo]]</center>
<center> '''Figure.  Halo benchmarks''' </center>


Overall, memory consistency on DSM systems can require latencies while waiting for acknowledgements to see completion of writes and loss of performance from in-order execution.
<b>NAS Benchmarks</b>
Results are given for both an
MPI implementation and a Co-Array Fortran implementation
due to Alan Wallcraft. For the two
IBM systems, results are given for both MPI and
OpenMP implementations.Due to the nature of the memory access patterns
in the hierarchical multigrid algorithm, MG is typically
very sensitive to latency, both interprocess
and memory. From these data, the X1 performance
scales reasonably well, with the only significant
problem at the transition from 4 to 8 processors,
i.e. the transition from using only one SMP node to
using multiple nodes. Co-Array Fortran is not a significant
performance enhancer for this code, problem
size, and number of processors. While percentage of
peak is only 10-16% on the X1, this code was not
modified to improve vectorization or streaming on
the X1.
<center>[[Image:Crayx1.jpg|Cray Parallel Benchmark]]</center>
<center> '''Figure. Cray benchmark''' </center>


Relaxed memory consistency models are techniques normally used to alleviate performance concerns, and are discussed in detail as improvements in specific DSM systems.


== Latency of interconnections ==
<b>Parallel Oceam Program</b>
From these data, POP on the X1 performs well
compared to the nonvector systems, but still lags
slightly behind the performance on the Earth Simulator.
The time spent in the baroclinic process is
nearly identical on the Earth Simulator and the X1.
Performance is better in the barotropic process on
the X1 than on the Earth Simulator, primarily due
to the use of Co-Array Fortran in the conjugate gradient
solver. Much of the X1 version of POP is still
identical to the Earth Simulator version, there is every
indication that X1 performance can be improved.


A mentioned in the cache coherency and memory consistency sections, interconnections unique distinguish DSM systems from bus based systemsInterconnections are unlike a bus in that they do not guarantee that messages reach recipients, and certainly aren't seen by the receivers at the same moment. Each message must be sent as a transaction in a networking protocol, and each packet sent has at least the latency of hops through routers in the network, and can also incur latency in being generated.
<center>[[Image:AppX1.png ]]</center>
<center> '''FigurePOP benchmark''' </center>


Since messages can become ubiquitous if the DSM system was naively designed to perform identically to a bus based system, care must be taken to design coherence protocols and consistency models that minimize the sending and receiving of messages, and if they must be sent or received, allow for overlapping of execution without blocking while waiting for messages or their receipt.
='''References'''=
<references/>

Latest revision as of 03:26, 20 April 2013

NOTE: This is a modified wiki.All references from the Sohlin's book and class notes have been avoided.This is a standalone resource which condenses data from research papers,hence requires previous knowledge of DSM alongwith some idea of performance issues faced by designers.

Introduction

Distributed shared memory received much attention because it offers the power of parallel computing using multiple processors as well as a single system memory view which makes the programming task easy.

We start by classifying DSM's on basis of implementations and characteristics.Software and Hardware DSM's face different performance issues hence we need to figure out the limitations of the system. Memory Consistency,Coherence in a distributed shared memory system are important issues because there might be some potential problems when different processors and caches are used to update shared single memory space. In order to improve performance and get correct result of computation, distributed shared memory systems designers should choose the proper paradigm of memory coherence semantics ,consistency protocols .<ref>http://crystal.uta.edu/~kumar/cse6306/papers/Chingwen.pdf</ref>

Scalability,Granularity come next as these systems consist of a collection of independent computers connected by a high-speed interconnection network. If designers choose the network topology carefully, the system can scale to a large number of nodes.


DSM’s are also concerned with the interconnection network that provide the data to the requesting processor in an efficient and timely fashion. Both the bandwidth (amount of data that can be supplied in a unit time) and latency (the time it takes to receive the first piece of requested data from the time the request isissued) are important to the design of DSM’s. Precisely because of the generally longer latencies encountered in large scale DSM’s, multithreading has received considerable attention; multithreading model can be utilized to tolerate (or mask) memory latencies. <ref>http://web.engr.oregonstate.edu/~benl/Publications/Book_Chapters/Advances_in_Computers_DSM00.pdf</ref>

After looking into SMP model parameters which are intrinsic to DSM architechture we move on to external factors such as API and Memory organization.Major improvements can be achieved if the above parameters are taken into account when designing a Distributed Memory Architecture system.

Implementation of DSM is taken up next and explained in detail alongwith benchmarks.This helps in understanding the benefits of keeping the above discussed parameters in mind when tweaking the designs based on available information.Future work on this can include interrelation among various parameters and the need to come up with a better technique to figure out performance meterics for a DSM based system.


Figure. DSM Designs

DSM Classification<ref>http://www.cs.rit.edu/~pns6910/docs/Distributed%20Shared%20Memory%20Systems/A%20survey%20of%20distributed%20shared%20memory%20systems.pdf</ref>

To provide an overview

of DSM, all possible platforms must be considered in DSM design.The choice relies on classifying all existing systems into appropriate non-overlapping subsets of systems.

DSM implementation level types-

1 Hardware 
2 Software 
 2.1 Operating system 
     2.1.1 Inside the kernel
     2.1.2 Outside the kernel 
 2.2 Runtime library routines 
2.3 Compiler-inserted primitives 
3 Hardware/software combination 

The level of DSM implementation affects both the programming model and the overall system performance. While the

hardware solutions bring total transparency to the programmer, and achieve very low access latencies, software solutions can better exploit the application behavior and represent the

ideal polygon to experiment with new concepts and algorithms.

As the consequence, the number of software DSM systems presented in the open literature is considerably higher, but the

systems intending to become commercial products and standards are mostly hardware-oriented.Architectural configuration of the system affects the system performance, since it can offer or restrict a good potential for parallel processing of requests related to the DSM management. It also strongly affects the scalability. Since a system applying a DSM mechanism is usually organized as a set of clusters

connected by an interconnection network, architectural parameters include:

a) Cluster configuration (single/multiple processors, with/without, shared/private, single/multiple level caches, etc.)
b) Interconnection network (bus hierarchy, ring, mesh, hyper- cube, specific LAN, etc.)

Cluster configuration is usually very important for the hardware-oriented proposals that integrate the mechanisms of cache

coherence on the lower level with the DSM mechanisms on the higher level of the system organization, or even store all shared data in large caches. Cluster configuration is mostly transparent for software solutions, It includes the memory organization and

the placement of directory, as well.

Almost all types of interconnection networks found in multiprocessors and distributed systems have also been used in DSM systems, The majority of software-oriented DSM systems were actually build on the top of Ethernet, although some of the solutions tend to be architecture independent and portable to various platforms. On the other hand, topologies such as bus hierarchy or mesh are typical for hardware solutions. The choice of topology can be also very important for the implementation of DSM algorithm, since it affects the possibility and cost of broad-

cast and multicast transactions.

The impact of organization to the overall system performance is closely related to the locality of data access.

Hardware solutions always deal with non-structured data objects (typically cache blocks), while many software implementations tend to use data items that represent logical entities, in order to take advantage of the locality naturally expressed by the application. On the other hand, some software solutions, based on virtual memory mechanisms, organize data in larger physical

blocks (pages), counting on the coarse-grain sharing.

Software DSM<ref>http://csag.ucsd.edu/individual/yskee/publication/parco04.pdf</ref>

Software distributed shared memory (SDSM) systems which provide shared address space have been of great importance to distributed memory architectures.

Early SDSM systems like IVY [1], Midway [2], Munin [3], and TreadMarks [4] assume uniprocessor nodes, thus allow only one thread per process on a node. Currently, commodity off-the-shelf microprocessors and network components are widely used as building blocks for parallel computers. This trend has made cluster systems consisting of symmetric multiprocessors (SMPs) attractive platforms for high performance computing. However, the early single-threaded SDSM systems are too restricted to exploit multiprocessors in SMP clusters. The next generation SDSM systems like Quarks [5], Brazos [6], DSM-Threads [7], and Murks [8] are aware of multiprocessors and exploit them by means of multiprocesses or multithreads. In general, naive process-based systems experience high context switchingoverhead and additional inter-process communication delay within a node, so the focus is on multi-threaded SDSM systems. Many single-threaded SDSM systems are implemented at user-level by using the

page fault handling mechanisms.


The SDSM system faces a dilemma when multiple threads compete to access an invalid

page within a short interval. On the first access to an invalid page, the system should set the page writable to replace with a valid one. Unfortunately, this change also allows other application threads to access the same page freely. This phenomenon is known as atomic page update and change right problem [7] or a race condition [8]. A known solution to this problem adopted by major multithreaded SDSM systems like TreadMarks [9], Brazos [6], and Strings [10] is to map a file to two different virtual addresses. Even though the file mapping method achieves good performance on some systems, file mapping is not always the best solution. Operating system and working environment may severely affect the performance of these systems. The file mapping method performs poorly in some cases; for example, an IBM SP Night Hawk system .Moreover, file mapping has high initialization cost and reduces the available address space because SDSM and application partition the address space. We note the cause of the atomic page update problem is that SDSM and application share the same address space. When SDSM changes a page writable, the page is also accessible to the application. A general solution to this problem is to separate the application address space from the system address space for the same physical memory, and to assign different access permission to each address space.

Hardware Support<ref>http://www.csl.cornell.edu/~espeight/papers/CSL-TR-2000-1008.pdf</ref>

Scalable cache-coherent distributed shared-memory (DSM) machines have received much attention in the literature since the late 1980s. To demonstrate their effectiveness, several cache-coherent non-uniform memory access (CC-NUMA) hardware DSM machines were built in the research com- munity (e.g. DASH [26], Alewife [2],) and commercial machines followed (e.g. SGI Origin 2000 [23], Sequent NUMA-Q [29], HP Exemplar [1],). At the same time, a large research effort produced a set of scientific benchmarks with which to evaluate DSM machines [48]. Most high-performance hardware DSM machines have tightly-integrated node or memory controllers that connect the microprocessor both to the memory system and to a proprietary high-speed switching network. The scalable coherence protocols used in such machines are implemented either in hardware finite-state machines or in software running on an embedded programmable device in the controller. Despite the resulting high performance of these systems, and efforts to show that the necessary additional hardware to support hardware DSM in commodity workstations and servers is small [25], high-end PC servers and engineering workstations have yet to integrate the additional functionality needed to build seamless hardware DSM from COTS

(commodity of-the-shelf) components.

Performance Concerns

A DSM machine has unique requirements compared to shared memory / bus based machines in order to provide cache coherence and memory consistency, as well as having interconnects. Performance concerns of each of these concepts in DSM machines are discussed in detail.

Cache coherence

The presence of multiple cached copies of a page requires a mechanism to notify other sharers

of a modified memory location. A new value is propagated by either invalidating or updating each copy, which correspond to write-invalidate and write-update coherence protocol respectively. Generally, the coherence protocol consists of the data structures and algorithms used

to implement a given memory consistency model.

A cache coherence protocol is in fact a mechanism of propagating newly written values so that all processors have a coherent view of the shared memory. It is normally designed under the requirement of a memory consistency model which specifies the coherence requirement on a cache coherence protocol, i.e., what “coherent view” of shared memory a cache coherence protocol should provide. Many new ideas accompany with relaxed memory consistency models have been proposed to efficiently propagate newly written values. Examples of these new ideas

include lazy release consistency, and delay consistency etc.

Traditionally, there are two main methods to maintain coherence: snoopy protocol and directory-based protocol. The snoopy coherence protocol requires the support of hardware so it is not widely used in software DSM systems. Up to now, almost all software DSM systems adopt the directory scheme or that similar to directory scheme. However, the scalability of the directory-based scheme is limited by the organization mode of the directory. Therefore, we

propose a new scheme to maintain the coherence between multiple cache copies.

Memory consistency <ref>http://www.cs.wayne.edu/~weisong/papers/shi00-jiajia.pdf</ref>

The Memory consistency model is an interface between the programmer and the system.

The memory consistency model of a shared memory system formally specifies how the memory system will appear to the programmer. Normally, a memory consistency model defines constraints on the order in which memory accesses can be performed in shared memory systems. It influences both programming complexity and performance. The stricter the memory consistency model, the easier for programmer to program, and the less opportunity for optimization. A strict consistency model like sequential consistency is intuitive to the programmer. However, with the large granularity of coherence unit (a page) in shared virtual memory systems, the false sharing problem will be so serious that the performance of software DSM systems under sequential consistency is very poor. For example, the performance of the first software DSM system IVY is so poor that the main contribution of it is the original idea about software DSM, while the practical system is useless. To improve performance, software DSM systems normally adopt relaxed memory consistency models which separate synchronization operations from ordinary load and store operations and allow ordinary operations to be executed out of order. The propagation and application of coherence operations are postponed until synchronization points. In the past decade, almost 20 different memory consistency models have been proposed for hardware-coherent systems. Among those relaxed consistency models, release consistency which separates acquire from release synchronization inspire a major breakthrough in the performance of software DSM

systems.

Although the memory consistency model specifies when coherence operations and data need to become visible, it can actually be implemented with various degrees of “laziness” in the propagation and application of both coherence and data. Greater laziness implies greater complexity and protocol and state, but fewer communication and protocol operations. For example, hardware-coherent systems that use release consistency tend to propagate coherence and apply them immediately, thus simplifying the data structures that need to memorize the

state of the cache line.

In software DSMs, it is very important to reduce the number of messages exchanged, because

sending a message in a software DSM is more expensive than that in a hardware DSM.

The TreadMarks’ lazy implementation of release consistency goes further. It does not propagate the modifications in a critical section at the time of release. Instead, modifications are buffered and are propagated merely to the processor that acquires the released lock until the time of acquire. In this way, lazy release consistency reduces both the number of messages and the amount of data exchanged. In LRC, before a processor can pass an acquire operation, all modifications that have been visible to the releasing processor must also be visible to the

acquiring processor.


Interconnection Latency

A mentioned in the cache coherency and memory consistency sections, interconnections unique distinguish DSM systems from bus based systems. Interconnections are unlike a bus in that they do not guarantee that messages reach recipients, and certainly aren't seen by the receivers at the same moment. Each message must be sent as a transaction in a networking protocol, and each packet sent has at least the latency of hops through routers in the network, and can also incur latency in being generated.

Since messages can become ubiquitous if the DSM system was naively designed to perform identically to a bus based system, care must be taken to design coherence protocols and consistency models that minimize the sending and receiving of messages, and if they must be sent or received, allow for overlapping of execution without blocking while waiting for messages or their receipt.

Granularity<ref>http://www.cdk5.net/dsm/Ed4/Chapter%2018%20DSM.pdf</ref>

An issue that is related to the structure of DSM is the granularity of sharing.

Conceptually, all processes share the entire contents of a DSM. As programs sharing DSM execute, however, only certain parts of the data are actually shared and then only for certain times during the execution. It would clearly be very wasteful for the DSM implementation always to transmit the entire contents of DSM as processes access and update it. What should be the unit of sharing in a DSM implementation? That is, when a process has written to DSM, which data does the DSM runtime send in order to provide

consistent values elsewhere?

The focus here is on page-based implementations, although the granularity issue

does arise in other implementations. In a page-based DSM, the hardware supports alterations to an address space efficiently in units of pages – essentially by the placement of a new page frame pointer in the page table (see, for example, Bacon [2002] for a description of paging). Page sizes can typically range up to 8 kilobytes, so this is an appreciable amount of data that must be transmitted over a network to keep remote copies consistent when an update occurs. By default, the price of the whole page transfer must be paid whether the entire page has been updated, or just

one byte of it.

Using a smaller page size does not necessarily lead

to an improvement in overall performance. First, in cases where processes do update large amounts of contiguous data, it is better to send one large page rather than several smaller pages in separate updates, because of the fixed software overheads per network packet. Second, using a small page as the unit of distribution leads to a large number of units that must be administered separately by the DSM implementation. To complicate matters further, processes tend to contend more for pages when the page size is large, because the likelihood that the data they access will lie within the same page increases with the page size. Consider, for example, two processes, one of which accesses only data item A while the other accesses only data item B, which lie within the same page. For the sake of concreteness, let us assume that one process reads A and the other updates B. There is no contention at the application level. However, the entire page must be transmitted between the processes, since the DSM runtime does not by default know which locations in the page have been altered. This phenomenon is known as false sharing: two or more processes share parts of a page, but only one in fact accesses each part. In write-invalidate protocols, false sharing can lead to unnecessary invalidations. In write-update protocols, when several writers falsely share data items they may cause them to be overwritten with older versions. In practice, the choice of the unit of sharing has to be made based on the physical page sizes available, although a unit of several contiguous pages may be taken if the page size is small. The layout of data with respect to page boundaries is an important factor in determining the number of page transfers made when a program executes.

Thrashing

A potential problem with write-invalidate protocols is thrashing. Thrashing is said to occur where the DSM runtime spends an inordinate amount of time invalidating and transferring shared data compared with the time spent by application processes doing useful work. It occurs when several processes compete for the same data item, or for falsely shared data items. If, for example, one process repeatedly reads a data item that another is regularly updating, then this item will be constantly transferred from the writer and invalidated at the reader. This is an example of a sharing pattern for which write-invalidate is inappropriate and write-update would be better. The next section describes the Mirage approach to thrashing, in which computers ‘own’ pages for a minimum period; Section 18.4 describes how Munin allows the programmer to declare access patterns to the DSM system so that it can choose appropriate update options for each data item and avoid thrashing.

Scalabilty<ref>http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf</ref>

A theoretical benefit of

DSM systems is that they scale better than tightly coupled shared-memory multiprocessors. The limits of scalability are greatly reduced by two factors: central bottlenecks (such as the bus of a tightly coupled shared- memory multiprocessor), and global common knowledge operations and storage (such as broadcast messages or full directories, whose sizes are proportional to the number of nodes). Li and Hudak2 went through several iterations to refine a coherence protocol for Ivy before arriving at their dy- namic distributed-manager algorithm,

which avoids centralized bottlenecks.

However, Ivy and most other DSM

systems are currently implemented on top of Ethernet (itself a centralized bottleneck), which can support only about 100 nodes at a time. This limitation is most likely a result of these systems being research tools rather than an indication of any real design flaw. ShivaY is an implementation of DSM on an Intel iPSCR hypercube, and it should scale nicely. Nodes in the Dash system are connected on two meshes. This implies that the machine should be expandable, but the Dash prototype is currently limited by its use of a full bit vector (one bit per node) to keep track of page replication. Heterogeneity. At first glance, sharing memory between two machines with different architectures seems almost impossible. The machines may not even use the same representation for basic data types (integers, floating-point numbers, and so on). It is a bit easier if the DSM system is structured as variables or objects in the source language. Then a DSM compiler can add conversion routines to all accesses to shared memory. In Agora, memory is structured as objects shared among heterogeneous machines. Mermaidlo explores another novel approach: Memory is shared in pages, and a page can contain only one type of data. Whenever a page is moved between two architecturally different systems, a conversion routine converts the data in the page to the appropriate

format.

Although heterogeneous DSM might allow more machines to participate in a computation, the overhead of conversion seems to outweigh the benefits. Related algorithms. To support a DSM system, synchronization operations and memory management must be specially tuned. Semaphores, for example, are typically implemented on shared- memory systems by using spin locks. In a DSM system, a spin lock can easily cause thrashing, because multiple nodes may heavily access shared data. For better performance, some systems pro- vide specialized synchronization primitives along with DSM. Clouds provides semaphore operations by grouping semaphores into centrally managed segments. Munin supports the synchroni- zation memory type with distributed locks. Plus supplies a variety of syn- chronization instructions, and supports delayed execution, in which the syn- chronization can be initiated, then later tested for successful completion. Memory management can be restructured for DSM. A typical memory- allocation scheme (as in the C library malloc()) allocates memory out of a common pool, which is searched each time a request is made. A linear search of all shared memory can be expensive. A better approach is to partition avail- able memory into private buffers on each node and allocate memory from the global buffer space only when the

private buffer is empty.

Other Factors

Application Programming Interface

As we know, whether a new software system is competitive depends greatly on the friendship

of it’s application programming interface(API). In parallel computing field, shared memory and message passing are two mainstream programming model which are widely used in the world. Shared memory programming model is a natural extension of traditional sequential programming model so that it is easy to be accepted by the application programmers. In this model, the application programmer does not need to consider the data partition, migration and underlying communication mechanism. On the contrary, the message passing programming model requires the application programmer to take the data partition into account, and use system supplied communication functions explicitly, which burden the application programmer much. As a result, the shared memory model is widely adopted and advocated by the researchers, application programmers and corporations, which fuels the success of shared virtual memory

system.

Generally speaking, the programming model of software DSM system is similar to traditional shared memory systems, such as Single Program and Multiple Data (SPMD) and Multithreading, which are familiar with many programmers. However, the API of a software DSM system is closely related to the memory consistency model implemented in the system and the implementation level of the system. For example, in a software DSM system with sequential consistency, the programmer can write the program just like a general SPMD programming mode. They can allocate a shared variable at any time and use them easily. However, if the Entry Consistency is used, the application programmer must mark the affinity between

shared data and synchronization object explicitly in the program.

The implement level of software DSM system will affect the programming interface too. If the system is implemented on the language level, such as Linda and Orca, the programmer must learn those new characteristics in the languages related to shared memory address space. If the system is implement on the complier or operating system level, all the work are done by the compiler and all the changes are transparent to application users, as in Shasta[100]. If the software DSM system is implemented by runtime library, what the application programmer should do is adding some function calls in the source program, and linking the related library

when compiling.

Memory Organization

The memory organization of a software DSM system determines the way shared virtual memory

is organized on top of the isolated memory address space. It has great influence on the cache

coherence protocol and consequently on system overhead.

Normally, the shared virtual memory address space of a software DSM system is organized in a COMA-like way, where the local memory of a host is treated as a large cache, and pages can be replicated and migrated on demand. Migrating the owner of data according to the sharing pattern reduces the probability of access miss for some applications characterized with single writer reference pattern, but requires a mechanism to locate the data when local access is failed. In systems that adopt the COMA shared virtual memory organization scheme, each page has an owner, and mechanisms such as probable owner or approximate copyset are employed to find where the owner of the faulting page is when the page fault occurs. Examples of this kind

of system includes IVY, Munin, TreadMarks, CVM, and Quarks.

The shared virtual address space of a software DSM can also be organized like traditional CC-NUMA. In this kind of system, each shared page has a fixed home, when the page faulting occurs, the faulting processor can fetch the up-to-date page from the home directly, only one round-trip communication is needed. This requires that the coherence protocol to propagate modified information to the home node on time. The main advantage of the home-based protocol is its very low memory overhead compared to homeless counterpart. Besides the simplicity of servicing ordinary page fault, another advantage of the home-based software DSM system is that no diffs generation are required for writes to home pages. Evaluation results from Zhou demonstrate that home-based system have comparable or better performance than its COMA counterparts, though platform dependent evaluation make the evaluation results not so convincible. Some other memory organization schemes which stem from COMA and NUMA have been implemented and studied in DSM systems. The I-ACOMA and simple-COMA are examples of these varieties.

Performance Improvements

Cray X1 Supercomputer

The Cray X1 combines the globally-addressable,

distributed shared memory architecture with high memory and network interconnect bandwidth. In order to sustain high bandwidth vector processing, the X1 is based on previous MPP Cray designs that emphasized memory bandwidth, as well as more recent vector concepts such as multi-streaming and vector caching. The system uses a network interconnect reminiscent of the Cray T3E to connect Cray nodes in order to unite long, latency-tolerant vector computations with the scalability to be expected from MPPs.

Figure. Cray X1 architechture

Figure above illustrates the architecture of a single Cray X1 node, the basic building block of the system. Each MSP contains 4 SSPs each with 2 vector and 1 scalar unit consists of four multi-streaming processors (MSPs) and a flat, shared 16GB physical memory. Each MSP in turn is composed of four single-streaming processors (SSPs), each with two vector pipelines and one scalar processor. The four SSPs also share a 2MB data “E-Cache”, which helps supply enough memory bandwidth to saturate the vector units. As is the case with many vector platforms, applications whose critical paths do not vectorize tend to exhibit poor performance; in addition to operating at twice the clock speed, the ability of the vector units to overlap memory operations with computation makes the Cray X1’s vector units significantly more powerful than the scalar pipeline. The X1 offers two configurations for program execution. Explicit parallelism is achieved in the SSP mode by treating each SSP as a separate processor, such that the node essentially behaves as a 16-way SMP. The alternative MSP mode maps each execution thread to an MSP, and utilizes compiler-directed multi-streaming transformations to accomplish automatic parallelization across the constituent SSP hardware. The multi-streaming process divides either vectorized inner loops or unvectorized outer loops into four independent segments, and assigns them to different SSPs to be executed in parallel. An early performance evaluation of the Cray X1 suggests that many parallel applications can achieve significant performance on the machine, given sufficient porting

and optimization efforts.

Fast interconnect between nodes

– for up to 128 nodes, a 4D hypercube – for more nodes (up to 1024), “enhanced” 3D torus - Synchronization via atomic memory operations – for locks, barriers, etc. - One MSP acts like an SMP,, p but each processor can also directly address memory on another MSP – remote accesses go over the network, and bypass local cache – use VM address translation to turn 64-bit virtual address into - physical address with node number and local 36-bit physical address - remote access not good for single read/write (high latency, no caching ), but good for block put/get operations (one-sided)

interconnect, compared to other contemporary DSM machines.

Benchmarks show good performance from high bandwidth and

relatively low latency to local and remote memory, fast

interconnect, compared to other contemporary DSM machines.<ref>http://www.cs.umd.edu/class/spring2010/cmsc714/Lectures/tera-cray-lect10.pdf</ref>

Benchmarks and applications Performance<ref>http://www.csm.ornl.gov/~worley/papers/CUG03.WorleyDunigan.X1.pdf</ref>

HALO Performance comparison of different MPI protocols for exchange.The optimal implementation for all halo sizes uses a persistent MPI exchange protocol.

Halo
Figure. Halo benchmarks

NAS Benchmarks Results are given for both an MPI implementation and a Co-Array Fortran implementation due to Alan Wallcraft. For the two IBM systems, results are given for both MPI and OpenMP implementations.Due to the nature of the memory access patterns in the hierarchical multigrid algorithm, MG is typically very sensitive to latency, both interprocess and memory. From these data, the X1 performance scales reasonably well, with the only significant problem at the transition from 4 to 8 processors, i.e. the transition from using only one SMP node to using multiple nodes. Co-Array Fortran is not a significant performance enhancer for this code, problem size, and number of processors. While percentage of peak is only 10-16% on the X1, this code was not modified to improve vectorization or streaming on the X1.

Cray Parallel Benchmark
Figure. Cray benchmark


Parallel Oceam Program From these data, POP on the X1 performs well compared to the nonvector systems, but still lags slightly behind the performance on the Earth Simulator. The time spent in the baroclinic process is nearly identical on the Earth Simulator and the X1. Performance is better in the barotropic process on the X1 than on the Earth Simulator, primarily due to the use of Co-Array Fortran in the conjugate gradient solver. Much of the X1 version of POP is still identical to the Earth Simulator version, there is every indication that X1 performance can be improved.

Figure. POP benchmark

References

<references/>