CSC/ECE 506 Spring 2013/11a lh: Difference between revisions
(Created page with "=Introduction= When dealing with a relatively small number of processors (8-16), according to Solihin 320, using a bus based shared memory structure is fine. Unfortunately, when...") |
|||
(31 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
*[https://docs.google.com/document/d/1Vo3GVW0DuRDtno2OBIcWUOGwjTOtjH-DNVExRMWKCtk/edit Write-up of This Topic.] | |||
*Related Articles: | |||
**[http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_506_Spring_2012/11a_ht ECE_506_Spring_2012/11a_ht] | |||
**[http://wiki.expertiza.ncsu.edu/index.php/ECE506_CSC/ECE_506_Spring_2012/11a_az ECE506 CSC/ECE 506 Spring 2012/11a az] | |||
=Introduction= | =Introduction= | ||
When dealing with a relatively small number of processors (8-16), according to Solihin 320, using a bus based shared memory structure is fine. Unfortunately, when you need to provide a shared memory structure for processors much greater than that, you will need a different set of organization. This new organization is needed due to the physical limitations of the bus. There are two ways you can create such a system. These include [http://en.wikipedia.org/wiki/Distributed_shared_memory Distributed Shared Memory] (DSM) or [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non-Uniform Memory Access] (NUMA). The benefits of having a DSM and NUMA is that we can now scale to a larger amount of processors. The disadvantage is that scaling in such a way may not be the most cost-effective solution, Solihin 320. For the remainder of this section, we will be discussing the performance of DSM's. | When dealing with a relatively small number of processors (8-16), according to Solihin 320, using a bus based shared memory structure is fine. Unfortunately, when you need to provide a shared memory structure for processors much greater than that, you will need a different set of organization. This new organization is needed due to the physical limitations of the bus. There are two ways you can create such a system. These include [http://en.wikipedia.org/wiki/Distributed_shared_memory Distributed Shared Memory] (DSM) or [http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access Non-Uniform Memory Access] (NUMA). The benefits of having a DSM and NUMA is that we can now scale to a larger amount of processors. The disadvantage is that scaling in such a way may not be the most cost-effective solution, Solihin 320. For the remainder of this section, we will be discussing the performance of DSM's. | ||
Line 12: | Line 16: | ||
From the table, we can see that there is three ways to scale a multiprocessor system. The first being a single bus system. This is the least scalable due to the limitations of the bus wire itself. As you add processors you will decrease the bus speed due to having to increase the wire length. Also, you run into an issue of overwhelming the bus due to the amount of traffic. The second way is to use a point-to-point bus system. This allows for the speed of the bus to remain relatively fast, but since the traffic will also scale with the number of processors, there will be a limitation due to overwhelming the bus system with traffic. Lastly, the most scalable system to date is using a directory system. This allows for the bus to remain fast due to the short wires, and the bus traffic to remain low since the directory holds information on cache locations. | From the table, we can see that there is three ways to scale a multiprocessor system. The first being a single bus system. This is the least scalable due to the limitations of the bus wire itself. As you add processors you will decrease the bus speed due to having to increase the wire length. Also, you run into an issue of overwhelming the bus due to the amount of traffic. The second way is to use a point-to-point bus system. This allows for the speed of the bus to remain relatively fast, but since the traffic will also scale with the number of processors, there will be a limitation due to overwhelming the bus system with traffic. Lastly, the most scalable system to date is using a directory system. This allows for the bus to remain fast due to the short wires, and the bus traffic to remain low since the directory holds information on cache locations. | ||
= | = Performance Concerns = | ||
A DSM machine has unique requirements compared to shared memory / bus based machines in order to provide cache coherence and memory consistency, as well as having interconnects. Performance concerns of each of these concepts in DSM machines are discussed in detail. | |||
== Maintaining cache coherence == | |||
To maintain correct cache coherence, write propagation and write serialization must be provided, both of which can have adverse effects on performance. | |||
Write serialization requires that all writes to a memory location be seen in the same order by other processors. Earlier, an example was given indicating how write serialization can be violated by observing writes out of order. A naive implementation of write serialization would require that a request and all it's messages are performed atomically to avoid overlapping of requests [1, p. 338]. Solihin [1, p. 342-344] discusses correctness issues that can occur if requests are allowed to overlap without special consideration. A non-overlapping approach would require that each request has conditions defined that indicate when it was begun and when it ends, in order for the home node to observe and wait for completion prior to processing other requests to the same block. | |||
The performance concern of disallowing overlapping of requests is that subsequent read or write operations to the same block would be delayed from initiating, even if some of the messages within the requests can be overlapped without correctness concerns. | |||
Another performance problem that can arise through cache coherence are false sharing misses. False sharing can be explained by an example. Suppose two processors have a cache block caches in the shared state, but processor A is reading and writing to a variable x within this block, and processor B is reading and writing to a variable y within this block. Although both processors are not attempting to access each others variables, since both variables map to the same cache block, then each processor either invalidates or sends updates to the other <i>without the other actually needing the data</i>. in a DSM system, these invalidations or updates can unnecessarily utilize a significant amount of interconnect bandwidth. | |||
== Maintaining memory consistency == | |||
Depending on the memory consistency model that is being enforced, performance can be lost be having to ensure various degrees of atomicity of memory accesses and program ordering. | |||
For sequential consistency, to follow program order requires that the program executes statements in the order defined by the source code for a thread. The implication is that statements within a thread cannot be executed out of order, so compiler optimizations and processor optimizations that include out of order execution as an attempt to reduce the latency of individual instructions and increase instruction level parallelism on pipelined architectures must be avoided to varying degrees. [1, p. 293] | |||
For atomicity of memory accesses to be seen by all processors, special considerations need to be made for DSM systems. In general, all processors must be able to detect when a load or a store has completed. For a store, or write atomicity, on a bus based system, the completion can be assumed as occurring as soon as a read exclusive request reach the bus. This is because all processors on the bus are guaranteed to see the request and will immediately invalidate themselves, resulting in the next read to the location becoming a read miss, requiring the block to be re-cached with the most up to date value. On a DSM system, however, there is no bus, so a write cannot be assumed complete as soon as invalidations are sent on the network. Rather, they must wait until acknowledgements of the invalidations are received which can take many hops and incur high latency, especially if there is network congestion. [1, p.292] | |||
Overall, memory consistency on DSM systems can require latencies while waiting for acknowledgements to see completion of writes and loss of performance from in-order execution. | |||
Relaxed memory consistency models are techniques normally used to alleviate performance concerns, and are discussed in detail as improvements in specific DSM systems. | |||
== Latency of interconnections == | |||
A mentioned in the cache coherency and memory consistency sections, interconnections unique distinguish DSM systems from bus based systems. Interconnections are unlike a bus in that they do not guarantee that messages reach recipients, and certainly aren't seen by the receivers at the same moment. Each message must be sent as a transaction in a networking protocol, and each packet sent has at least the latency of hops through routers in the network, and can also incur latency in being generated. | |||
Since messages can become ubiquitous if the DSM system was naively designed to perform identically to a bus based system, care must be taken to design coherence protocols and consistency models that minimize the sending and receiving of messages, and if they must be sent or received, allow for overlapping of execution without blocking while waiting for messages or their receipt. | |||
=Implementations of Optimizations= | |||
Architectures and programming models have been created that allow DSM machines to alleviate some of the performance issues inherent in DSM vs. bus based systems while also avoiding the overheads of maintaining cache coherence and memory consistency, and the latency of interconnects. | |||
==Shasta Protocol== | ==Shasta Protocol== | ||
Line 78: | Line 79: | ||
The Shasta protocol allows the application to explicitly specify the home processor for individual pages instead of relying on the default round-robin allocation. The protocol also supports non-binding prefetch and prefetch-exclusive directives. The Shasta system can optionally supply information on source code lines that suffer the most number of remote misses by keeping extra state within the protocol. The programmer can use this information to identify places where prefetching may be helpful. | The Shasta protocol allows the application to explicitly specify the home processor for individual pages instead of relying on the default round-robin allocation. The protocol also supports non-binding prefetch and prefetch-exclusive directives. The Shasta system can optionally supply information on source code lines that suffer the most number of remote misses by keeping extra state within the protocol. The programmer can use this information to identify places where prefetching may be helpful. | ||
<br><br> | <br><br> | ||
===Performance=== | |||
<center>[[Image:Shasta-overhead.png]]</center> | |||
<center>[[Image:Shasta-overhead-graph.png]]</center> | |||
<center> '''Sequential times and checking overheads for the SPLASH-2 applications.''' <ref name="shasta protocol report"></ref></center> | |||
<center>[[Image:Shasta-performance.png]]</center> | |||
<center> '''Speedups of SPLASH-2 applications running with 64-byte block size.''' <ref name="shasta protocol report"></ref></center> | |||
==Parallel File Input/Output in Cohesion system== | ==Parallel File Input/Output in Cohesion system== | ||
Line 141: | Line 151: | ||
For this example, we can see "...four MSPs, 16 memory controller chips (M-chips), and 32 memory daughter cards form a Cray X1 node."<ref name="dunigan"></ref> The memory banks provide 204 GB/s of bandwith which is enough for the processors and streams. Also, these banks use ECC memory which has built in correcting of single-bit errors and detection of multiple-bit errors as well as providing chip-kill error detection. These modules are connected using X1 routing modules. Each node uses full duplex links and each memory module has an even and odd 64-bit link. The overall topology is a 4-D hypercube, though larger configurations move towards a 2-D torus. Since this is a hardware supported DSM, there is little more to discuss regarding the system, so we will move forward into evaluating the performance. | For this example, we can see "...four MSPs, 16 memory controller chips (M-chips), and 32 memory daughter cards form a Cray X1 node."<ref name="dunigan"></ref> The memory banks provide 204 GB/s of bandwith which is enough for the processors and streams. Also, these banks use ECC memory which has built in correcting of single-bit errors and detection of multiple-bit errors as well as providing chip-kill error detection. These modules are connected using X1 routing modules. Each node uses full duplex links and each memory module has an even and odd 64-bit link. The overall topology is a 4-D hypercube, though larger configurations move towards a 2-D torus. Since this is a hardware supported DSM, there is little more to discuss regarding the system, so we will move forward into evaluating the performance. | ||
===Performance=== | ===Performance Improvements=== | ||
==== Interconnect Latency ==== | |||
The first area of performance improvements that this architecture realizes is in reducing the impact of the latency of interconnections. In a traditional DSM system, remote memory accesses would be passed over an interconnect and would suffer the asymmetric latency of NUMA. In the X1, programs can configure and dictate how nodes are used <ref name="cray x11 optimization"> "Optimizing Applications on the Cray X1TM System", | |||
[http://docs.cray.com/books/S-2315-52/html-S-2315-52/S-2315-52-toc.html webpage]</ref> | |||
In one configuration, an application executes within a node. The result is that it executes as if it were on an SMP system with all accesses to memory being local and thus being uniform latency, avoiding the latency of the interconnections to other nodes <ref name="cray x11 optimization"></ref>. This configuration allows the application programmer to use APIs such as OpenMP and utilize shared memory programming models within a node. Since this configuration limits the resources to four MSPs, the programs must be capable of achieving desired performance in four MSPs or, in the context of programs the require more than one node (using message passing with MPI across nodes), must have non-overlapping of memory accesses between nodes. Most scientific applications exhibit little data sharing, but those that do require data sharing see a performance improvement from running within a node and sharing the memory at the local node level through a simple node configuration. <ref name="cray x11 optimization"></ref> | |||
In another configuration, an application execution spans multiple nodes. Through this configuration, all address translations for remote nodes are handled through the TLBs emulating RTTs, but sacrifices some of the performance and gains extra flexibility over the execution within a node. <ref name="cray x11 optimization"></ref> | |||
Understanding the X1 system, we see that the X1 node supports a cache-coherent shared memory and the Cray supports OpenMP, System V shared memory, and POSIX threads shared memory programming models <ref name="dunigan"></ref>. The Cray system also supports several DSM models. Donigan provides many benchmark graphs to show that the performance of the X1 is much greater than previous systems. Please refer to the paper for these graphs <ref name="dunigan"></ref> | Understanding the X1 system, we see that the X1 node supports a cache-coherent shared memory and the Cray supports OpenMP, System V shared memory, and POSIX threads shared memory programming models <ref name="dunigan"></ref>. The Cray system also supports several DSM models. Donigan provides many benchmark graphs to show that the performance of the X1 is much greater than previous systems. Please refer to the paper for these graphs <ref name="dunigan"></ref> | ||
<br> | |||
< | ==== Cache Coherence ==== | ||
Cache coherence is improved on by enabling a larger amount of cache hits and reducing cache misses for the types of applications that would tend to be executed on the Cray X1. | |||
One method of improving results based on cache coherence is through vectorizing calculations to minimize latency. Since most scientific applications will involve floating point operations on data that don't have many dependences, then these applications can have their performance improved by the 32 element vector processors within the SSPs. In order to achieve high performance for vectorized programs, it is important to keep the vector pipeline full <ref name="cray x11 optimization"></ref>. | |||
To aid in this, the Cray Fortran and C compilers will automatically vectorize loops as much as possible if they can determine that there are no data dependences and the calculations fit within an MSP. If the compiler produces suboptimal results because it is using static code analysis, then it is possible, through algorithm analysis, for the application programmer to utilize compiler directives to force a different optimization (including preferring vectorization for one loop over another) to improve performance from vectorization and maximizing cache hits. <ref name="cray x11 optimization"></ref>. More detail on these techniques can be found in the <ref name="cray x11 optimization"></ref>. | |||
=== Performance === | |||
==== Climate Modeling ==== | |||
The Parallel Ocean Program is an ocean modeling code developed at Los Alamos National Laboratory that is used as the ocean component in the Community System Climate Model coupled climate model. <ref name="dunigan"></ref> | |||
<center>[[Image:Crayx1-ocean.png]]</center> | |||
<center>Performance of the LANL Parallel Ocean Program (POP 1.4.3). <ref name="dunigan"></ref></center> | |||
==== Fusion Simulation ==== | |||
GYRO is an Eulerian gyrokinetic-Maxwell solver developed by R.E. Waltz and J. Candy at General Atomics. It is used to study plasma microturbulence in fusion research.<ref name="dunigan"></ref> | |||
<center>[[Image:Crayx1-gyro-performance.png]]</center> | |||
<center>Performance of the GYRO Eulerian GyrokineticMaxwell Solver (64-mode GTC benchmark). <ref name="dunigan"></ref></center> | |||
<center>[[Image:Crayx1-gyro-communication.png]]</center> | |||
<center>Communication Fraction for the GYRO Eulerian Gyrokinetic-Maxwell Solver (64-mode GTC benchmark). <ref name="dunigan"></ref></center> | |||
=Glossary= | =Glossary= |
Latest revision as of 22:20, 16 April 2013
- Write-up of This Topic.
- Related Articles:
Introduction
When dealing with a relatively small number of processors (8-16), according to Solihin 320, using a bus based shared memory structure is fine. Unfortunately, when you need to provide a shared memory structure for processors much greater than that, you will need a different set of organization. This new organization is needed due to the physical limitations of the bus. There are two ways you can create such a system. These include Distributed Shared Memory (DSM) or Non-Uniform Memory Access (NUMA). The benefits of having a DSM and NUMA is that we can now scale to a larger amount of processors. The disadvantage is that scaling in such a way may not be the most cost-effective solution, Solihin 320. For the remainder of this section, we will be discussing the performance of DSM's.
According to Solihin 320, there are two aspects that restrict the scalability of bus-based multiprocessors. These include the physical limitations of interconnections and the limitations of the protocol. To explain in detail, on a bus-based system, adding a processor will not affect any other physical restrictions on the system. Unfortunately, when adding a new processor, you will be reducing the speed of the bus. Second, the protocol needed to keep coherence does not scale well. As you increase the number of processors to the system, the amount of traffic also increases. This means that you might run the risk of overwhelming the bandwith. According to Solihin, there are a few ways that we can mitigate this problem. The following is from 321 of the Solihin textbook.
From the table, we can see that there is three ways to scale a multiprocessor system. The first being a single bus system. This is the least scalable due to the limitations of the bus wire itself. As you add processors you will decrease the bus speed due to having to increase the wire length. Also, you run into an issue of overwhelming the bus due to the amount of traffic. The second way is to use a point-to-point bus system. This allows for the speed of the bus to remain relatively fast, but since the traffic will also scale with the number of processors, there will be a limitation due to overwhelming the bus system with traffic. Lastly, the most scalable system to date is using a directory system. This allows for the bus to remain fast due to the short wires, and the bus traffic to remain low since the directory holds information on cache locations.
Performance Concerns
A DSM machine has unique requirements compared to shared memory / bus based machines in order to provide cache coherence and memory consistency, as well as having interconnects. Performance concerns of each of these concepts in DSM machines are discussed in detail.
Maintaining cache coherence
To maintain correct cache coherence, write propagation and write serialization must be provided, both of which can have adverse effects on performance.
Write serialization requires that all writes to a memory location be seen in the same order by other processors. Earlier, an example was given indicating how write serialization can be violated by observing writes out of order. A naive implementation of write serialization would require that a request and all it's messages are performed atomically to avoid overlapping of requests [1, p. 338]. Solihin [1, p. 342-344] discusses correctness issues that can occur if requests are allowed to overlap without special consideration. A non-overlapping approach would require that each request has conditions defined that indicate when it was begun and when it ends, in order for the home node to observe and wait for completion prior to processing other requests to the same block.
The performance concern of disallowing overlapping of requests is that subsequent read or write operations to the same block would be delayed from initiating, even if some of the messages within the requests can be overlapped without correctness concerns.
Another performance problem that can arise through cache coherence are false sharing misses. False sharing can be explained by an example. Suppose two processors have a cache block caches in the shared state, but processor A is reading and writing to a variable x within this block, and processor B is reading and writing to a variable y within this block. Although both processors are not attempting to access each others variables, since both variables map to the same cache block, then each processor either invalidates or sends updates to the other without the other actually needing the data. in a DSM system, these invalidations or updates can unnecessarily utilize a significant amount of interconnect bandwidth.
Maintaining memory consistency
Depending on the memory consistency model that is being enforced, performance can be lost be having to ensure various degrees of atomicity of memory accesses and program ordering.
For sequential consistency, to follow program order requires that the program executes statements in the order defined by the source code for a thread. The implication is that statements within a thread cannot be executed out of order, so compiler optimizations and processor optimizations that include out of order execution as an attempt to reduce the latency of individual instructions and increase instruction level parallelism on pipelined architectures must be avoided to varying degrees. [1, p. 293]
For atomicity of memory accesses to be seen by all processors, special considerations need to be made for DSM systems. In general, all processors must be able to detect when a load or a store has completed. For a store, or write atomicity, on a bus based system, the completion can be assumed as occurring as soon as a read exclusive request reach the bus. This is because all processors on the bus are guaranteed to see the request and will immediately invalidate themselves, resulting in the next read to the location becoming a read miss, requiring the block to be re-cached with the most up to date value. On a DSM system, however, there is no bus, so a write cannot be assumed complete as soon as invalidations are sent on the network. Rather, they must wait until acknowledgements of the invalidations are received which can take many hops and incur high latency, especially if there is network congestion. [1, p.292]
Overall, memory consistency on DSM systems can require latencies while waiting for acknowledgements to see completion of writes and loss of performance from in-order execution.
Relaxed memory consistency models are techniques normally used to alleviate performance concerns, and are discussed in detail as improvements in specific DSM systems.
Latency of interconnections
A mentioned in the cache coherency and memory consistency sections, interconnections unique distinguish DSM systems from bus based systems. Interconnections are unlike a bus in that they do not guarantee that messages reach recipients, and certainly aren't seen by the receivers at the same moment. Each message must be sent as a transaction in a networking protocol, and each packet sent has at least the latency of hops through routers in the network, and can also incur latency in being generated.
Since messages can become ubiquitous if the DSM system was naively designed to perform identically to a bus based system, care must be taken to design coherence protocols and consistency models that minimize the sending and receiving of messages, and if they must be sent or received, allow for overlapping of execution without blocking while waiting for messages or their receipt.
Implementations of Optimizations
Architectures and programming models have been created that allow DSM machines to alleviate some of the performance issues inherent in DSM vs. bus based systems while also avoiding the overheads of maintaining cache coherence and memory consistency, and the latency of interconnects.
Shasta Protocol
The Shasta protocol improves on the regular software disributed shared memory system such that the shared data can be kept coherent at fine granularity. This is implemented by inserting inline code that checkes the cache state of shared data before each load or store. This protocol also allows the coherence granularity to be varied across different shared data structures in a single application, thus alleviating any potential inefficiencies that arise from the fixed large(page-size) granularity of the communication typical in most software shared memory systems. As in the hardware cache-coherent multiprocessors, shared data in this systems has three states - Invalid, Shared and Exclusive. This protocol provides a number of mechanisms for dealing with the long communication latencies in a workstation cluster. It minimizes extraneous coherence messages, and hence requires fewer messages to satisfy shared memory operations compared to protocols commonly used in hardware DSM systems. It also includes optimizations such as non-blocking stores, that aggressively exploit a relaxed memory consistency model. Other optimizations include detection of migratory data sharing, issuing multiple load misses simultaneously, merging of load and store misses to the same cache line, and support for prefetching and home placement directives. <ref name="shasta protocol report">Scales, D.J.; Gharachorloo, K.; , "Performance of the Shasta Distributed Shared Memory Protocol" WRL Research Report 97/2 , February 1997 report</ref>
Multiple Coherence Granularity
The Shasta protocol can support multiple granularities for communication and coherence, even with a single application. This can provide a significant performance boost in a software DSM system, since data with good spatial locality can be communicated at a coarse grain to amortize large communication overheads , while data prone to false sharing can use a finer sharing granularity. The implementation automatically chooses a block size based on the allocated size of a data structure. The basic heuristic is to choose a block size equal to the object size up to a certain threshold; the block size for objects larger than a given threshold is simply set to the base Shasta line size (typically set to be 64 bytes). The rationale for the heuristic is that small objects should be transferred as a single unit, while larger objects (e.g. large arrays) should be communicated at a fine granularity to avoid false sharing. Since the choice of the block size does not affect the correctness of the program, the programmer can freely experiment with various block sizes to tune the performance of an application. Controlling the coherence granularity in this manner is significantly simpler than approaches adopted by object-based or region-based DSM systems, since the latter approaches can affect correctness and typically require a more substantial change to the application.
Minimization of Protocol Messages
The Shasta protocol is designed to minimize extraneous coherence messages, given the relatively high overheads associated with handling messages in software DSM implementations. In this protocol, the current owner node specified by the directory guarantees to service a request that is forwarded to it. The fact that the current owner guarantees to service a request that is forwarded to it allows the protocol to complete all directory state changes when a requestor first reaches the home. This property eliminates the need for extra messages that are sent back to the home to confirm that the forwareded request is satisfied. Since this protocol supports dirty sharing, it also eliminates the need for sending an up-to-date copy of the line back to the home in case of read transaction when the home node is remote and the data is dirty in another node. Supporting exclusive requests reduces the need for fetching data on a store if the requesting processor already has the line in shared state. Also, the number of invalidation acknowledgements that are expected for an exclusive request are piggybacked on on of the invalidation acknowledgements to the requestor instead of being sent as a separate message.
Exploitation of Relaxed Memory Models
This protocol emulates the behavior of a processor with non-blocking loads and stores and a lockup-free cache, thus exploiting release consistency. Non-blocking stores are supported by issuing a read-exclusive or exclusive request, recording where the stre occured, and continuing. This information allows the protocol to appropriately merge the reply data with the newly written data that is already in memory. This protocol also exhibits a limited form of non-blocking load behavior
due to the batching optimization, since batching can lead to multiple outstanding loads . Lockup-free behavior for lines that are in a pending state is also supported by allowing writes to a pending line to proceed by storing the newly written data into memory and recording the location of the stores in the miss handler invoked due to the pending state.
Detection of Migratory Sharing Patterns
The Shasta protocol provides a sophisticated mechanism for detecting data that is shared in a migratory fashion and optimizing accesses to such data. Migratory sharing occurs when data is read and modified by different processors, leading to the migration of the data from one processor to another. By keeping extra information at each directory entry, the protocol detects whether the data in each line exhibits migratory behavior. A line is designated for migratory conversion after the migratory sharing pattern is successfully observed for a threshold number of times. A read request to a line that is designated for migratory conversion is automatically converted to a read-exclusive request at the directory. This conversion avoids the load miss followed by a store miss to the same line that is typical for migratory shared data. The protocol provides a mechanism to revert a line from migratory conversion. The reply data for a converted read request is cached with a special caching state , which is called exclusive migratory. Operations by the owner processor treat the line as exclusive, and a subsequent store by that processor changes the line to the ordinary exclusive state. The protocol detects a break in the migratory behavior if an incoming request from another processor arrives before the owner processor writes to the line ,i.e., while line is still in exclusive-migratory state. In this case, a message is sent to the home directory to nullify or revert the migratory conversion for that line. The line may subsequently be designated for migratory conversion if migratory behavior is observed again.
Prefetching and Home Placement Directives
The Shasta protocol allows the application to explicitly specify the home processor for individual pages instead of relying on the default round-robin allocation. The protocol also supports non-binding prefetch and prefetch-exclusive directives. The Shasta system can optionally supply information on source code lines that suffer the most number of remote misses by keeping extra state within the protocol. The programmer can use this information to identify places where prefetching may be helpful.
Performance
Parallel File Input/Output in Cohesion system
A parallel file I/O system that is independent of the memory consistency models reduces the file-related network traffic in the page-based software DSM systems built on a network of workstations. File accesses in page-based software Distributed Shared Memory (DSM) systems are usually performed by a single node, which may lead to a poor overall performance because a large amount of network traffic is generated to transfer data between this file handling node and the other nodes. This system alleviates this problem. The two main features in this design are the adaptive data distribution scheme and the delayed file access mechanism. The former distributes file blocks among the nodes according to the access pattern of the application, while the latter ensures that the data are transferred to the consumer node instead of the request node by exploiting the memory mapping features of the virtual shared address space of the DSM systems. Previously, most researches had been focused on the speedup of the computation phase of a DSM application via relaxed memory consistency models. The time-consuming file access operations were neglected. For ease of programming, the file accesses in a DSM application are usually performed in the sequentially executed initialization and completion phases on a certain node. As a result, a large amount of network traffic is generated to distribute the file data to other nodes during computation. This may degrade system performance since the networks employed in many DSM systems are usually slow compared to other components such as CPU, or even disk. The parallel file I/O system helps combat this network traffic. The effectiveness of this design has been verified by implementing a prototype on a page-based software DSM system named Cohesion. <ref name="parallel file system">Ce-Kuen Shieh; Su Cheong Mac; Jyh-Chang Ueng, "Improving the Performance of Distributed Shared Memory Systems via Parallel File Input/Output" The Journal of Systems and Software 44(1998) 3-15 , August 1997
paper</ref>
Design
The two main goals in the design are to parallelize the file accesses and reduce network traffic, and to minimize modifications on the existing DSMapplications. In this parallel file I/O system, every computation node also acts as a storage node. A parallel file is partitioned into file blocks, which are scattered among the nodes so that each node has a disjoint portion of the parallel file. In the pre-run of an application, the files are placed on the root node(node 0), which is the node that initiates the DSM system and the application. The data access pattern of the application is collected and the files are then partitioned into file blocks and redistributed among all nodes via an adaptive distribution scheme. In the subsequent executions, the root node issues the read requests via a delayed file access mechanism to ensure that the file blocks are placed in the consumer nodes instead of the requester node, i.e., the root node. Whenc omputation is completed, write requests are issued by the root node via the delayed access mechanism so that the data on each node are stored in its local disk instead of the root node, thereby reducing the network traffic.
Adaptive Data Distribution Scheme
An adaptive data distribution scheme is used to reduce the amount of network transfers. In this scheme, file blocks are distributed among nodes according to the application's access pattern. It is divided into the metadata file and the information collection technique. A metadata file for each parallel file on each node is required to keep track of the global block numbers of the file blocks in the local file portion, as the file blocks are distributed on the nodes according to the access patterns, which vary across applications. Every file block stored in the disk of a node has an entry in the associated metadata file, and this entry records the global block number of that file block in the parallel file. When the parallel file is opened, a request is broadcasted and the metadata files are opened. When the file read is issued, a read request is broadcasted and each node reads the file blocks from its local disk into the specified shared memory locations according to the metadata file and the starting memory address given in the read function call. When a file is written,the owner node of each memory page of the result data writes that page into its local disk,and the metadata are generated by recording its block number. An information collection technique records the access pattern of a DSM application. In the pre-run execution, the whole input file of the application is placed at the root node. During initialization, the root node reads the file into its local memory pages and maps these pages to their corresponding global virtual addresses spaces. During computation, page faults will be generated when the file-related pages, i.e., the memory pages containing the file data, are accessed bythe other nodes except the root node. This page fault information is collected by our information collection module. By the end of the computation phase, each node has a list of the previously accessed file-related pages and this list is sent to the root node for parsing. In this way, the application's access pattern of the input file is ascertained and the file blocks are then redistributed accordingly. After redistributing the file blocks, a metadata file on each node is then created according to this information. In case of a file write, the owner nodeof a target page writes that page into the local disk, a nd each node creates the metadata file accordingly. With this technique, the amount of network traffic during computation is greatly reduced, except in the pre-run execution of the application.
Delayed File Access Mechanism
When a system employs ordinary file access mechanism, all file data are sent to the root node when a file is read no matter how we distribute the file blocks, because the requester node is always the root node in our system. The file data are then serialized by the network, negating the benefit of file I/O parallelization. Furthermore, the root node may use only a portion of the file data during computation. As a result, moving all file data to the root node for each file access induces unnecessary network traffic.
In the delayed file access mechanism, the requested file block is directly put into the specified memory page of the storag e node inst ead of the re-quester node. When the root node issues a file read operation, a file read request with the memory address specifed in the read operation is broadcasted . Every node then reads a file block from its local disk and stores the block in a local memory page. The metadata header is checked to find the virtual shared add ress space where this local memory page should be mapped, and the local memory page is then mapped to the corresponding virtual shared address space. The reading of the file blocks and the address mapping are repeated until all file blocks are processed. Each node then returns the file block numbers that it has handled to the root node, and the root node broadcasts the node locations of all memory pages containing the file blocks so that other nodes can locate these pages when page faults occur. In this way, the actual network transfer of file blocks is delayed until the data are actually used during computation. The amount of network messages in this mechanism may not be less than that in the previously mentioned ordinary mechanism, but the amount of file data transfer across network is lessened with the delayed mechanism.
Implementation in Cohesion system
Cohesion is a page-based software DSM system built on a network of intel ´86 based microcomputers running MSDOS with DOS-extender and iRMK Real -Time Kernel . The shared address space of a program running on Cohesion can be divided into two regions, which are categorized as the conventional memory and the eager release memory. A programmer can choose a region for his shared object by inheriting the corresponding base class provided by the system. In Cohesion, we have developed a kernel- level MSDOS-compatible file manipulation module which provides the basic file services such as open/close and read/write. There are four parallel file operations in this proto-type: POpen,PRead,PWrite, and PClose. POpen opens a parallel file for read or write access . An open notification is broadcasted by the root node. Each node then opens its local file portions and reads the metadata headersin case of a file read, or creates a local file in case of a file write. Via our delayed file access mechanism, PRead reads the content of the parallel file from the disks to the virtual shared address space and PWrite writes the contentof a virtual shared memory region to the disks. Experimentally, the total execution time of the 2048 * 1024 20-iterations Successive Over Relaxation program on eight nodes was reduced from 196 seconds with sequential I/O to 108 seconds with parallel I/O, while it was reduced in the 512 * 512 Matrix Multiplication program on eight nodes from 100 seconds with sequential I/O to 84 seconds with parallel I/O.
Cray X1
Designed at the Oak Ridge National Laboratory, the Cray X1 supercomputer is a DSM vector multiprocessor. According to Dunigan <ref name="dunigan">T. H. Dunigan, Jr., J. S. Vetter, J. B. White, III, and P. H. Worley, BPerformance evaluation of the Cray X1 distributed shared-memory architecture,[ IEEE Micro, vol. 25, pp. 30–40, Jan.-Feb. 2005. paper</ref>, "The X1's hierarchial design uses the basic building blocks of multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations." It is also scalable to 4096 processors and up to 65 terabytes of memory. This super computer was introduced in 2002 and shows significant performance improvements on several applications. The X1 system attempts to take older Cray supercomputers and improve upon their systems. One of the interesting aspects of this system is that it performs much like a NUMA system. The difference is that each SMP node is not cached.
Overview
The system is an attempt to incorporate all the previous Cray vector systems into one design. One aspect of the X1 is the high memory bandwith. This will help to realize the theoretical peak performance. The Cray X1 uses a Multi-Streaming Processor (MSP). The hierarchical design of the system uses these MSP as a building block for the system. Each processor is made up of four single-streaming processors. Below is a a graphical representation of an individual Cray MSP module.
What is so interesting about these modules is how they can be interconnected to create a larger Cray X1 Node shown below.
For this example, we can see "...four MSPs, 16 memory controller chips (M-chips), and 32 memory daughter cards form a Cray X1 node."<ref name="dunigan"></ref> The memory banks provide 204 GB/s of bandwith which is enough for the processors and streams. Also, these banks use ECC memory which has built in correcting of single-bit errors and detection of multiple-bit errors as well as providing chip-kill error detection. These modules are connected using X1 routing modules. Each node uses full duplex links and each memory module has an even and odd 64-bit link. The overall topology is a 4-D hypercube, though larger configurations move towards a 2-D torus. Since this is a hardware supported DSM, there is little more to discuss regarding the system, so we will move forward into evaluating the performance.
Performance Improvements
Interconnect Latency
The first area of performance improvements that this architecture realizes is in reducing the impact of the latency of interconnections. In a traditional DSM system, remote memory accesses would be passed over an interconnect and would suffer the asymmetric latency of NUMA. In the X1, programs can configure and dictate how nodes are used <ref name="cray x11 optimization"> "Optimizing Applications on the Cray X1TM System", webpage</ref>
In one configuration, an application executes within a node. The result is that it executes as if it were on an SMP system with all accesses to memory being local and thus being uniform latency, avoiding the latency of the interconnections to other nodes <ref name="cray x11 optimization"></ref>. This configuration allows the application programmer to use APIs such as OpenMP and utilize shared memory programming models within a node. Since this configuration limits the resources to four MSPs, the programs must be capable of achieving desired performance in four MSPs or, in the context of programs the require more than one node (using message passing with MPI across nodes), must have non-overlapping of memory accesses between nodes. Most scientific applications exhibit little data sharing, but those that do require data sharing see a performance improvement from running within a node and sharing the memory at the local node level through a simple node configuration. <ref name="cray x11 optimization"></ref>
In another configuration, an application execution spans multiple nodes. Through this configuration, all address translations for remote nodes are handled through the TLBs emulating RTTs, but sacrifices some of the performance and gains extra flexibility over the execution within a node. <ref name="cray x11 optimization"></ref>
Understanding the X1 system, we see that the X1 node supports a cache-coherent shared memory and the Cray supports OpenMP, System V shared memory, and POSIX threads shared memory programming models <ref name="dunigan"></ref>. The Cray system also supports several DSM models. Donigan provides many benchmark graphs to show that the performance of the X1 is much greater than previous systems. Please refer to the paper for these graphs <ref name="dunigan"></ref>
Cache Coherence
Cache coherence is improved on by enabling a larger amount of cache hits and reducing cache misses for the types of applications that would tend to be executed on the Cray X1.
One method of improving results based on cache coherence is through vectorizing calculations to minimize latency. Since most scientific applications will involve floating point operations on data that don't have many dependences, then these applications can have their performance improved by the 32 element vector processors within the SSPs. In order to achieve high performance for vectorized programs, it is important to keep the vector pipeline full <ref name="cray x11 optimization"></ref>.
To aid in this, the Cray Fortran and C compilers will automatically vectorize loops as much as possible if they can determine that there are no data dependences and the calculations fit within an MSP. If the compiler produces suboptimal results because it is using static code analysis, then it is possible, through algorithm analysis, for the application programmer to utilize compiler directives to force a different optimization (including preferring vectorization for one loop over another) to improve performance from vectorization and maximizing cache hits. <ref name="cray x11 optimization"></ref>. More detail on these techniques can be found in the <ref name="cray x11 optimization"></ref>.
Performance
Climate Modeling
The Parallel Ocean Program is an ocean modeling code developed at Los Alamos National Laboratory that is used as the ocean component in the Community System Climate Model coupled climate model. <ref name="dunigan"></ref>
Fusion Simulation
GYRO is an Eulerian gyrokinetic-Maxwell solver developed by R.E. Waltz and J. Candy at General Atomics. It is used to study plasma microturbulence in fusion research.<ref name="dunigan"></ref>
Glossary
Distributed Shared Memory: A class of software and hardware implementations in which each node of a cluster has access to shared memory in addition to each node's non-shared private memory.
Non-Uniform Memory Access:
A multiprocessing design where the memory access time depends on the memory
location relative to a processor(local memory is accessed faster).
Granularity:
In parallel computing, the amount of computation in relation to communication,
i.e., the ratio of computation to the amount of communication.
Metadata File:
A file containing the data about each parallel file on each node.
Topology:
The layout pattern of interconnections of the various elements such as links,
nodes, etc. of a network.
Quiz
Question 1 : DSM are used for multi-processor systems that have more than _____ processors.
a) 4 b) 8 c) 16 d) 32
Question 2 : DSM has what two problems that causes implementation slowdowns?
a) Memory consistency and hardware support b) Hardware support and cache coherence c) Protocols and cache coherence d) None of the above
Question 3 : The Cray X1 is a _____ support for DSM.
a) Software b) Hardware c) Cache Coherence d) Memory Consistency
Question 4 : What of the following is NOT true of Shasta?
a) Exploitation of relaxed memory models b) Maximization of protocol messages c) Detection of migratory sharing patterns d) Prefetching directives
Question 5 : Parallel file I/O cohesion uses a ___ .
a) Delayed file access b) Instant file access c) No file access d) Virtual file access
Question 6 : The Cray X1 uses a ___ .
a) Module with six processing threads b) Module with a single processing thread c) Module with four processing thread d) Module with eight processing thread
Question 7 : Hardware support is less researched due to ___ .
a) Cost of hardware b) Lack of software support for new hardware c) Lack of new hardware ideas d) Compatibility issues
Question 8 : The first DSM was designed in ___ .
a) 1975 b) 1983 c) 1986 d) 1990
Question 9 : In a DSM handling messages is ___ .
a) Relatively low cost b) Has not impact on performance c) Relatively high cost d) No cost as it is hardware supported
Question 10 : Shasta was developed in ___ .
a) 1986 b) 1996 c) 2006 d) 1976
References
<references />
See Also
1. More on DSM design issues, the following is a graph and the reference points to the paper it came from.Image.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=84877&isnumber=2778
2. http://ieeexplore.ieee.org.prox.lib.ncsu.edu/stamp/stamp.jsp?arnumber=301925