ECE506 CSC/ECE 506 Spring 2012/11a az: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
 
(31 intermediate revisions by 2 users not shown)
Line 4: Line 4:


<b>Under construction</b>
<b>Under construction</b>
THE REFERENCES (correct formatting and hyperlinks), DEFINITIONS, AND QUIZ ARE STILL UNDER CONSTRUCTION.  PLEASE BE PATIENT AND BE SURE TO RE-REVIEW BY THE END OF THE WEEK.
Parallel processor system architectures can be implemented as distributed shared memory systems, in which clusters of processors or single processors are connected in a larger network of nodes, where each node has it's own physical memory, but the physical memories are combined into one shared memory which each node can address.
This article discusses the requirements of three basic components of DSM parallel architecture design: cache coherence, memory consistency, and interconnects. Each is first described and then examined for performance concerns. Finally, case studies are performed on a hardware (Cray X1) and a software implementation (Shasta DSM protocol) of DSM systems to discuss specific performance improvements that have been made.


== Cache coherence ==
== Cache coherence ==
Line 10: Line 16:


Ensuring that a value changed in one cache is sent to another cache is called <i>write propagation</i>. [1, p. 183] Write propagation is one of the requirements that must be addressed to be provide cache coherency. Without write propagation, one processor could modify a cached value and not notify the other processors that have the same value cached. The other caches may believe they have the latest data, thus on subsequent reads, their caches will provide it to their respective processors, leading to incoherent results.
Ensuring that a value changed in one cache is sent to another cache is called <i>write propagation</i>. [1, p. 183] Write propagation is one of the requirements that must be addressed to be provide cache coherency. Without write propagation, one processor could modify a cached value and not notify the other processors that have the same value cached. The other caches may believe they have the latest data, thus on subsequent reads, their caches will provide it to their respective processors, leading to incoherent results.
FIXME: EXAMPLE?


Another requirement for cache coherence is <i>write serialization</i>, which Solihin [1 p. 183] defines as a requirement that "multiple changes to a single memory location are seen in the same order by all processors". If two processors perform writes to a single memory location in a certain order, then all other processors in the system should see the writes (by reading that memory location and subsequently caching the values) in the order in which they were written. If other processors observe the writes by reading the variable, but see the writes in different orders, this can lead to incoherent copies of the same variable in multiple caches while each think they have the latest copy.
Another requirement for cache coherence is <i>write serialization</i>, which Solihin [1 p. 183] defines as a requirement that "multiple changes to a single memory location are seen in the same order by all processors". If two processors perform writes to a single memory location in a certain order, then all other processors in the system should see the writes (by reading that memory location and subsequently caching the values) in the order in which they were written. If other processors observe the writes by reading the variable, but see the writes in different orders, this can lead to incoherent copies of the same variable in multiple caches while each think they have the latest copy.
FIXME: EXAMPLE?


Thus, for correctness purposes, it is required that write propagation and write serialization are provided by the cache coherence implementation.
Thus, for correctness purposes, it is required that write propagation and write serialization are provided by the cache coherence implementation.
Line 21: Line 23:
In order to maintain cache coherence, a cache coherence protocol is implemented in hardware (or in specific cases, in software).  In DSM systems, the cache coherence controller in a node interfaces with the processor and it's cache, but also has a communication link to the other nodes through an interconnect network via a <i>communication assist</i>. It receives and acts upon requests from the local processor, as well as sends and receives (and acts on) requests sent to/from other nodes as <i>network transactions</i> through the communication assist.
In order to maintain cache coherence, a cache coherence protocol is implemented in hardware (or in specific cases, in software).  In DSM systems, the cache coherence controller in a node interfaces with the processor and it's cache, but also has a communication link to the other nodes through an interconnect network via a <i>communication assist</i>. It receives and acts upon requests from the local processor, as well as sends and receives (and acts on) requests sent to/from other nodes as <i>network transactions</i> through the communication assist.


Unlike bus based multiprocessor systems, the coherence controllers are not connected with a medium that allows for (serialized) communication nor bus signal lines, such as the SHARED line (which is asserted in a bus based system when another processor has a copy of that cache block which is being addressed).  In bus based systems, the bus is also the medium in which invalidations or updates are sent to other coherence controllers, depending on the coherence protocol. Further, bus based systems allow for snooping of requests from other coherence controllers such as read, read-exclusive, flushes, etc.  Since no bus exists, but invalidations or updates have to be sent to other coherence controllers, these are sent as network transactions. Additionally, since no bus exists, it isn't guaranteed that a request will be seen by other processors once it is sent, so acknowledgement messages are also sent as network transactions in response to requests.
Unlike bus based multiprocessor systems, the coherence controllers are not connected with a medium that allows for (serialized) communication nor bus signal lines, such as the SHARED line (which is asserted in a bus based system when another processor has a copy of that cache block which is being addressed).  In bus based systems, the bus is also the medium in which invalidations or updates are sent to other coherence controllers, depending on the coherence protocol. Further, bus based systems allow for snooping of requests from other coherence controllers such as read, read-exclusive, flushes, etc.  Since no bus exists, but invalidations or updates have to be sent to other coherence controllers, these are sent as network transactions. Additionally, since no bus exists, it isn't guaranteed that a request will be seen by other processors once it is sent, so acknowledgment messages are also sent as network transactions in response to requests.


DSM based systems do not replicate the broadcasting of messages to other coherence controllers as bus based systems do because the bandwidth requirements would be prohibitively large. Many DSM systems utilize a construct called a <i>directory</i> that stores information about which cache block is cached in which state by the different nodes to avoid having to broadcast invalidations, updates, upgrades, interventions, flushes, or other messages sent by coherence controllers on buses. The directory enables a node to select a subset of nodes as message recipients intelligently, thereby reducing the network traffic.
DSM based systems do not replicate the broadcasting of messages to other coherence controllers as bus based systems do because the bandwidth requirements would be prohibitively large. Many DSM systems utilize a construct called a <i>directory</i> that stores information about which cache block is cached in which state by the different nodes to avoid having to broadcast invalidations, updates, upgrades, interventions, flushes, or other messages sent by coherence controllers on buses. The directory enables a node to select a subset of nodes as message recipients intelligently, thereby reducing the network traffic.
Line 31: Line 33:
== Memory consistency ==
== Memory consistency ==


<b>Under construction</b>
DSM systems must also maintain memory consistency just as bus based systems. Memory consistency is concerned with how all processors in the system see the ordering of loads and stores [1, p.283]. If one or more processors see the loads and stores as occurring in different orders, it is possible that each develops an inconsistent view of memory.  Systems maintain memory consistency by implementing memory consistency models, which are discussed in detail below.
 
The need for memory consistency can be explained through an example concerning programmers intuition [1].  Suppose there is a program with three threads that has a cascading series of locks, as shown below, which is inspired by an example from [1, p. 285]. 
 
All variables are initialized to 0.
<pre>
P0:                                      P1:                                      P2:
s1: a = 1;                              s3: while(!a_rdy) {}                    s6: while(!ab_rdy) {}
s2: a_rdy = 1;                          s4: b = a + 1;                          s7: c = a * b;
                                        s5: ab_rdy = 1;
 
</pre>
 
On the surface, it would be expected that P0 sets a value to a and sets a_rdy.  P1 spins until a_rady is set, at which point it assigns b = a + 1 = 2, and sets a flag indicating a and b are ready.  P2 spins on ab_rdy until P1 sets it, when P2 finally sets c = a * b = 1 * 2 = 2.
 
Looking deeper, it is clear that this is only expected behavior because it is expected that s1 precedes s2 (s1->s2), and s3->s4->s5, and s6->s7. Expecting that the program execute as this is called the expectation of <i>program order</i>, which Solihin explains as "memory accesses are executed in a thread to follow the order in which they occur in the source code" [1, p. 285].
 
The other expectation is that the memory accesses are <i>atomic</i>, in that the memory accesses (namely writes) occur without overlapping with other memory accesses. An example of memory accesses being non-atomic (or overlapping) is from the example above.  Consider that P0 sends "a" to P1 and P2 and the value arrives at P1 immediately, but due to an interconnection delay, takes longer time to propagate to P2.  It is possible that P2 executes s7 and updates the value of c without having received the new value of "a" from P0, thereby setting c = a * b = 0 * 2 = 0, which is not the expected result. [1]
 
Combining the expectation of write atomicity and program order define the <i>sequential consistency</i> memory model.
 
Other relaxed memory consistency models also exist, but all serve to give the programmer an understanding of the memory consistency provided by the system to ensure that programs produce the expected output.  More info on memory consistency models can be found in Solihin [1].


== Interconnections ==
== Interconnections ==


<b>Under construction</b>
DSM systems consist of nodes that are connected together via interconnects. The collection of all interconnects combined is analogous to a bus on a bus-based multiprocessor system, but in the case of DSM systems, these interconnects are individual discrete connections.
 
Interconnects are the media through which cache coherence messages, such as invalidations, upgrades, cache to cache transfers, etc must be sent. Unlike in bus based systems where all processors exist on the same bus and will generally see things occur on the bus at the same time and in the same order, interconnects are point to point connections, so it is likely that a message sent from one node to two others will not arrive at the others at the same time nor in order.
 
Since the interconnects do not preserve serialization, protocols must be developed that dictate that messages must be sent, but where in a bus based system it could be taken for granted that if a message is put on the bus that it will be received, it can't be taken for granted in interconnects, thus the protocols must include a series of acknowledgement messages as responses to initiated messages. Since messages must be sent as discrete packets, these messages incur latency in being created, and since they must route from point to point, and sometimes are relayed through nodes as a series of responses / exchanges, extra latency can be incurred for a transaction that would've been much faster on a bus based system.
 
Special care must be taken to understand the latency and design it through protocol implementations.


= Performance Concerns =
= Performance Concerns =
Line 53: Line 82:
== Maintaining memory consistency ==
== Maintaining memory consistency ==


<b>Under construction</b>
Depending on the memory consistency model that is being enforced, performance can be lost be having to ensure various degrees of atomicity of memory accesses and program ordering.
 
For sequential consistency, to follow program order requires that the program executes statements in the order defined by the source code for a thread. The implication is that statements within a thread cannot be executed out of order, so compiler optimizations and processor optimizations that include out of order execution as an attempt to reduce the latency of individual instructions and increase instruction level parallelism on pipelined architectures must be avoided to varying degrees. [1, p. 293]
 
For atomicity of memory accesses to be seen by all processors, special considerations need to be made for DSM systems. In general, all processors must be able to detect when a load or a store has completed. For a store, or write atomicity, on a bus based system, the completion can be assumed as occurring as soon as a read exclusive request reach the bus. This is because all processors on the bus are guaranteed to see the request and will immediately invalidate themselves, resulting in the next read to the location becoming a read miss, requiring the block to be re-cached with the most up to date value. On a DSM system, however, there is no bus, so a write cannot be assumed complete as soon as invalidations are sent on the network. Rather, they must wait until acknowledgements of the invalidations are received which can take many hops and incur high latency, especially if there is network congestion. [1, p.292]
 
Overall, memory consistency on DSM systems can require latencies while waiting for acknowledgements to see completion of writes and loss of performance from in-order execution.
 
Relaxed memory consistency models are techniques normally used to alleviate performance concerns, and are discussed in detail as improvements in specific DSM systems.


== Latency of interconnections ==
== Latency of interconnections ==


<b>Under construction</b>
A mentioned in the cache coherency and memory consistency sections, interconnections unique distinguish DSM systems from bus based systems.  Interconnections are unlike a bus in that they do not guarantee that messages reach recipients, and certainly aren't seen by the receivers at the same moment. Each message must be sent as a transaction in a networking protocol, and each packet sent has at least the latency of hops through routers in the network, and can also incur latency in being generated.
 
Since messages can become ubiquitous if the DSM system was naively designed to perform identically to a bus based system, care must be taken to design coherence protocols and consistency models that minimize the sending and receiving of messages, and if they must be sent or received, allow for overlapping of execution without blocking while waiting for messages or their receipt.


= Performance Improvements =
= Performance Improvements =
<b>Under construction</b>


Architectures and programming models have been created that allow DSM machines to alleviate some of the performance issues inherent in DSM vs. bus based systems while also avoiding the overheads of maintaining cache coherence and memory consistency, and the latency of interconnects.
Architectures and programming models have been created that allow DSM machines to alleviate some of the performance issues inherent in DSM vs. bus based systems while also avoiding the overheads of maintaining cache coherence and memory consistency, and the latency of interconnects.


FIXME: Three? or just two?
Two examples are used to illustrate performance improvements. The first example is the Cray X1 system which is examined for it's ability to alleviate the impact of the interconnect latency and reduce coherence misses [2]. The second example is the Shasta, a software based DSM cache coherence protocol that improves cache coherence performance by reducing false sharing through variable page sizes, reduces latency by reducing interconnect traffic, and reduces latency by supporting upgrades [5].
 
Three OR TWO examples are used to illustrate performance improvements. The first example is the Cray X1 system which is examined for it's ability to alleviate the impact of the interconnect latency and reduce coherence misses [2]. The second example is the Shasta, a DSM cache coherence protocol that improves cache coherence performance purely through software by reducing false sharing misses [5]. The third is ERASE ME?


== Cray X1 Supercomputer ==
== Cray X1 Supercomputer ==
Line 85: Line 120:
The first area of performance improvements that this architecture realizes is in reducing the impact of the latency of interconnections. In a traditional DSM system, remote memory accesses would be passed over an interconnect and would suffer the asymmetric latency of NUMA. In the X1, programs can configure and dictate how nodes are used [3, CH 1].  
The first area of performance improvements that this architecture realizes is in reducing the impact of the latency of interconnections. In a traditional DSM system, remote memory accesses would be passed over an interconnect and would suffer the asymmetric latency of NUMA. In the X1, programs can configure and dictate how nodes are used [3, CH 1].  


In one configuration, an application executes within a node. The result is that it executes as if it were on an SMP system with all accesses to memory being local and thus being uniform latency, avoiding the latency of the interconnections to other nodes [3, CH 1]. This configuration allows the application programmer to use APIs such as OpenMP and utilize shared memory programming models within a node. Since this configuration limits the resources to four MSPs, the programs must be capable of achieving desired performance in four MSPs or, in the context of programs the require more than one node (using message passing with MPI across nodes), must have non-overlapping of memory accesses between nodes. Most scientific applications exhibit little data sharing, but those that do require data sharing see a performance improvement from running within a node and sharing the memory at the local node level through a simple node configuration.
In one configuration, an application executes within a node. The result is that it executes as if it were on an SMP system with all accesses to memory being local and thus being uniform latency, avoiding the latency of the interconnections to other nodes [3, CH 1]. This configuration allows the application programmer to use APIs such as OpenMP and utilize shared memory programming models within a node. Since this configuration limits the resources to four MSPs, the programs must be capable of achieving desired performance in four MSPs or, in the context of programs the require more than one node (using message passing with MPI across nodes), must have non-overlapping of memory accesses between nodes. Most scientific applications exhibit little data sharing, but those that do require data sharing see a performance improvement from running within a node and sharing the memory at the local node level through a simple node configuration. [3]


In another configuration, an application execution spans multiple nodes. Through this configuration, all address translations for remote nodes are handled through the TLBs emulating RTTs but sacrifices some of the performance and gains extra flexibility, over the execution within a node.
In another configuration, an application execution spans multiple nodes. Through this configuration, all address translations for remote nodes are handled through the TLBs emulating RTTs, but sacrifices some of the performance and gains extra flexibility over the execution within a node. [3]


=== Improving on cache coherence ===
=== Improving on cache coherence ===
Line 93: Line 128:
Cache coherence is improved on by enabling a larger amount of cache hits and reducing cache misses for the types of applications that would tend to be executed on the Cray X1.
Cache coherence is improved on by enabling a larger amount of cache hits and reducing cache misses for the types of applications that would tend to be executed on the Cray X1.


One method of improving results based on cache coherence is through vectorizing calculations to minimize latency. Since most scientific applications will involve floating point operations on data that don't have many dependences, then these applications can have their performance improved by the 32 element vector processors within the SSPs. In order to achieve high performance for vectorized programs, it is important to keep the vector pipeline full [3]. To aid in this, the Cray Fortran and C compilers will automatically vectorize loops as much as possible if they can determine that there are no data dependences and the calculations fit within an MSP. If the compiler produces suboptimal results because it is using static code analysis, then it is possible, through algorithm analysis, for the application programmer to utilize compiler directives to force a different optimization (including preferring vectorization for one loop over another) to improve performance from vectorization and maximizing cache hits. [3, CH 7].  More detail on these techniques can be found in the [3, CH. 7.4].
One method of improving results based on cache coherence is through vectorizing calculations to minimize latency. Since most scientific applications will involve floating point operations on data that don't have many dependences, then these applications can have their performance improved by the 32 element vector processors within the SSPs. In order to achieve high performance for vectorized programs, it is important to keep the vector pipeline full [3].  
 
To aid in this, the Cray Fortran and C compilers will automatically vectorize loops as much as possible if they can determine that there are no data dependences and the calculations fit within an MSP. If the compiler produces suboptimal results because it is using static code analysis, then it is possible, through algorithm analysis, for the application programmer to utilize compiler directives to force a different optimization (including preferring vectorization for one loop over another) to improve performance from vectorization and maximizing cache hits. [3, CH 7].  More detail on these techniques can be found in the [3, CH. 7.4].


== Shasta distributed shared memory protocol ==
== Shasta distributed shared memory protocol ==
Line 101: Line 138:
The motivation behind Shasta was to provide an alternative to SVM (shared virtual memory) systems which have hardware that produces a shared address space by sharing memory in a fixed page size, and which maintain coherence via interconnects. Shasta is designed to be flexible, enables implementation of "cache coherence protocols to improve parallel performance" in software, and does all this with less coherence messages than common DSM systems in hardware [5].
The motivation behind Shasta was to provide an alternative to SVM (shared virtual memory) systems which have hardware that produces a shared address space by sharing memory in a fixed page size, and which maintain coherence via interconnects. Shasta is designed to be flexible, enables implementation of "cache coherence protocols to improve parallel performance" in software, and does all this with less coherence messages than common DSM systems in hardware [5].


The cache coherence protocol is a directory based invalidation protocol. The directory entries contain pointers to the owner processor and full bit vectors for sharer processors.  The directories are decentralized and maintained at the different nodes (acting as home nodes). The shared address space is divided into blocks (of multiple sizes, as will be explored later), and each block itself is divided into fixed size lines, where each line has an entry in a state table. Coherence is maintained at the block level, but hit and miss checks are simplified since they only need to look at the state table entry for that line directly [5].
The cache coherence protocol is a directory based invalidation protocol. The directory entries contain pointers to the owner processor and full bit vectors for sharer processors.  The directories are decentralized and maintained at the different nodes (acting as home nodes). The shared address space is divided into blocks (of multiple sizes, as will be explored later), and each block itself is divided into fixed size lines, where each line has an entry in a state table. Coherence is maintained at the block level, but miss checks are simplified since they only need to look at the state table entry for that line directly [5].
 
Since Shasta is a software approach, a Shasta compiler inlines the instructions needed to adhere to the protocol, including those for miss checks prior to loads and stores.  


=== Improving on cache coherence and interconnect latency ===
=== Improving on cache coherence and interconnect latency ===
Line 107: Line 146:
Shasta improves on performance of cache coherence by reducing false sharing misses.  To accomplish this, the protocol supports subdividing the shared address space into ranges, where different ranges support different block sizes. Dynamic memory allocations are done in the shared memory space, so the protocol can intelligently select the correct range which has a properly matched page size from which to allocate the block. Multiple levels of granularity enable this fine grain sharing and reduce false sharing misses, thereby increasing performance in coherence protocols, without the complexity of dealing with false sharing directly [5].
Shasta improves on performance of cache coherence by reducing false sharing misses.  To accomplish this, the protocol supports subdividing the shared address space into ranges, where different ranges support different block sizes. Dynamic memory allocations are done in the shared memory space, so the protocol can intelligently select the correct range which has a properly matched page size from which to allocate the block. Multiple levels of granularity enable this fine grain sharing and reduce false sharing misses, thereby increasing performance in coherence protocols, without the complexity of dealing with false sharing directly [5].


Another method to improve on cache coherence is through the protocol requests. The protocol is a directory based invalidation protocol that supports dirty sharing. The three types of requests that the protocol supports are read, read exclusive (write), and upgrade. The upgrade request, known as "exclusive" in the protocol, enables a processor to convert a shared block into a modified block without having to incur the latency of a read exclusive.  
Shasta also improves on cache coherence performance by reducing the overhead of miss checks through various means.  


Shasta also improves on the latency of interconnects in relation to the cache coherence protocol by reducing coherence messages, thereby reducing traffic over interconnects and avoiding incurring interconnect latency penalties where possible [5].
First, it was mentioned that dynamic memory allocations are shared, but it was not explicitly stated that static allocations on the stack are not shared. The fact that these are allocated on the stack allows the miss check code to verify if the base register of a load or store uses or is calculated from the stack pointer or global pointer, and if so, it is static, available, and shortcuts the remainder of the miss check. [5]
 
Second, the protocol places a special invalid flag value into a line when it is invalidated. This allows for a miss check to quickly compare the line value with the flag value and thus determine if it is invalid or not to shortcut the miss check. If they are equal, the miss check continues to perform a state table check on the line to determine if it is legitimately a miss or if the line contents just happened to be valid but equaled the invalidation flag (known as a "false miss"). As the Shasta authors claim, false misses almost never occur (due to the pattern of the invalidation flag being written into every long word of the line, which is irregular). These techniques lead to large reductions in miss check time [5].
 
Third, miss checks for multiple loads and stores are groups, or batched. The Shasta authors provide an example (paraphrased here) in which consecutive loads and stores that are all relative to one base memory location and span a range less than or equal to a line. In this case, the most number of consecutive lines that are being accessed is two. If inline checks show that these lines are not invalid, then the loads and stores can continue without performing further checks, thereby reducing the latency of the miss check. Special techniques exist for checking the two consecutive line states with inline checks by using the base register and offsets. More detail on the implementation can be found in [5].
 
Another method to improve on cache coherence is through the protocol requests. The protocol is a directory based invalidation protocol that supports dirty sharing. The three types of requests that the protocol supports are read, read exclusive (write), and upgrade. The upgrade request, known as "exclusive" in the protocol, enables a processor to convert a shared block into a modified block without having to perform a read exclusive, thereby reducing the latency by that of the read-exclusive. [5]
 
Further improvements to the protocol requests are done by requiring that a home node is guaranteed to act on every request it receives. By doing so, the point at which a request is complete can be considered the time that the request reaches the home node rather than waiting for receipt of acknowledgements from the different nodes. This allows for the latency of a coherence request to be reduced [5].


=== Improving on memory consistency ===
=== Improving on memory consistency ===


Shasta improves performance with memory consistency by providing mechanisms to support weakened consistency compared to sequential consistency. Primarily, Shasta  
Shasta improves performance with memory consistency by providing mechanisms to support release consistency. Release consistency relaxes the requirements of sequential consistency to only require that synchronization accesses be sequentially consistent, that all loads and stores issued before an acquire are completed before the acquire while nothing after the acquire begin until after the acquire completes, and that all loads and stores prior to a release complete before the release executes, while subsequent accesses do not need to wait for the release to complete.
 
The protocol supports non-blocking loads and stores by adding transient states for lines: pending-invalid, which happens in response to a read or read exclusive, or pending-shared, which happens in response to an upgrade. This allows for subsequent memory access to continue for other lines, but not for the line in a transient state, so latency can be reduced for other accesses. [5]
 
The protocol also support non-blocking releases, which will delay a release but not block subsequent accesses from occurring before the release is complete. This allows for less latency in executing statements immediately after (or exiting) a critical section. [5].
 
Shasta also supports what is known as lockup-free behavior to lines. For example, a write to a line that is pending is allowed to complete by placing the new value into memory and the location into a miss-handler. This mis-handler is used on subsequent accesses to that line. This keeps subsequent writes from having to wait for the first write to complete, reducing the latency in those statements. Another lockup free feature is that reads can be services immediately is a line is in pending-shared since it is assumed to already have the latest value [5]. More detail can be found in [5].


= Definitions =
= Definitions =
Line 161: Line 214:
<b>Under construction</b>
<b>Under construction</b>


1) Solihin...
1) Y. Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Publishing & Consulting, LLC, 2009.
 
2) IEEE Micro cray X1 (ieee-micro-cray-x1-v8.pdf from Gehringer's wiki topics email)
2) IEEE Micro cray X1 (ieee-micro-cray-x1-v8.pdf from Gehringer's wiki topics email)
3) Optimizing Applications on the Cray X1 system (http://docs.cray.com/books/S-2315-52/html-S-2315-52/S-2315-52-toc.html)
3) Optimizing Applications on the Cray X1 system (http://docs.cray.com/books/S-2315-52/html-S-2315-52/S-2315-52-toc.html)
4) Wikipedia article on TLB - ( http://en.wikipedia.org/wiki/Translation_lookaside_buffer )
4) Wikipedia article on TLB - ( http://en.wikipedia.org/wiki/Translation_lookaside_buffer )
5) Performance of the Shasta Distributed Shared Memory Protocol (WRL-97-2.pdf from Gehringer's wiki topics email)
5) Performance of the Shasta Distributed Shared Memory Protocol (WRL-97-2.pdf from Gehringer's wiki topics email)


Line 172: Line 229:


<b>Under construction</b>
<b>Under construction</b>
<pre>
1) Which of the following are necessary for maintaining a correct
cache coherence:
a) write invalidation & write update
b) write serialization & write propagation
c) write serialization & write-through
d) write allocation & write buffer
2) Which of the following will NOT help the Cray X1 supercomputer
improve performance?
a) Reduce the impact of latency from interconnections
b) Vectorizing calculations to minimize latency.
c) Reducing the overhead of miss checks
d) Enabling larger amounts of cache hits while reducing cache misses
3) For the Cray X1 supercomputer, which of the following has the same functionality as an RTT?
a) SMP
b) MSP
c) DSM
d) TLB
4) Which of the following is NOT true?
a) Shasta represents a pure software approach to solving DSM performance issues.
b) The motivation behind Cray X1 was to create a SVM.
c) Shasta allows for multiple levels of granularity for pages in lieu of standard page sizes.
d) The SSPs of the Cray X1 are made up of 2 vector processors and one superscalar processor.
5) When improving the ''latency'' of a Cray X1 supercomputer, which of the following happens during the configuration where an application execution spans multiple nodes?
a) All address translations for remote nodes are handled through the TLBs emulating RTTs
b) Allows the application programmer to use APIs such as OpenMP
c) Utilize shared memory programming models within a node.
d) Limits the resources to four MSPs
6) True or False: Shasta supports what is known as lockup-free behavior to lines.
a) True
b) False
7) True or False: Sequential consistency is a type of memory consistency that requires atomic memory accesses and enforces program order of memory accesses across all processors.
a) True
b) False
8) Distributed shared memory systems utilize interconnects between ____.
a) routers
b) processors
c) nodes
d) cache coherence controllers
9) A performance concern of maintaining correctness in cache coherence is ______.
a) write serialization
b) write-back
c) invalidation
10) True or False: Write propagation on DSM machines can always be treated as on bus based systems where a write is seen when an invalidation is sent to sharers and a read is complete when data is returned to the requestor.
a) True
b) False
</pre>
== ''Answer Key'' ==
1) b.  2) c.  3) d. 4) b.  5) a.  6) a.  7) b.  8) c.  9) a.  10) b.

Latest revision as of 03:28, 26 April 2012

11a.  Performance of DSM systems.  Distributed shared memory systems combine the programming models of shared memory systems, and the scalability of distributed systems.  However, since DSM systems need extra coordination between software layer and underlying hardware, achieving good performance could be a big challenge. The factors that harm the performance could be the overhead to maintain cache coherence, memory consistency, and the latency of interconnections. Please further explore the factors that can affect the performance of DSM systems, and the improvements that have been made on the existing systems.

Introduction

Under construction

THE REFERENCES (correct formatting and hyperlinks), DEFINITIONS, AND QUIZ ARE STILL UNDER CONSTRUCTION. PLEASE BE PATIENT AND BE SURE TO RE-REVIEW BY THE END OF THE WEEK.

Parallel processor system architectures can be implemented as distributed shared memory systems, in which clusters of processors or single processors are connected in a larger network of nodes, where each node has it's own physical memory, but the physical memories are combined into one shared memory which each node can address.

This article discusses the requirements of three basic components of DSM parallel architecture design: cache coherence, memory consistency, and interconnects. Each is first described and then examined for performance concerns. Finally, case studies are performed on a hardware (Cray X1) and a software implementation (Shasta DSM protocol) of DSM systems to discuss specific performance improvements that have been made.

Cache coherence

DSM systems must maintain cache coherence just as it required by bus-based multiprocessor systems. Cache coherence problems arise when it is undefined how a change of a value in a specific processor's cache is propagated to the other caches [1, p. 183]. If multiple processors access and modify a shared location in memory and produce outputs based on that shared variable, it is possible to calculate incorrect values if cache coherence is not maintained.

Ensuring that a value changed in one cache is sent to another cache is called write propagation. [1, p. 183] Write propagation is one of the requirements that must be addressed to be provide cache coherency. Without write propagation, one processor could modify a cached value and not notify the other processors that have the same value cached. The other caches may believe they have the latest data, thus on subsequent reads, their caches will provide it to their respective processors, leading to incoherent results.

Another requirement for cache coherence is write serialization, which Solihin [1 p. 183] defines as a requirement that "multiple changes to a single memory location are seen in the same order by all processors". If two processors perform writes to a single memory location in a certain order, then all other processors in the system should see the writes (by reading that memory location and subsequently caching the values) in the order in which they were written. If other processors observe the writes by reading the variable, but see the writes in different orders, this can lead to incoherent copies of the same variable in multiple caches while each think they have the latest copy.

Thus, for correctness purposes, it is required that write propagation and write serialization are provided by the cache coherence implementation.

In order to maintain cache coherence, a cache coherence protocol is implemented in hardware (or in specific cases, in software). In DSM systems, the cache coherence controller in a node interfaces with the processor and it's cache, but also has a communication link to the other nodes through an interconnect network via a communication assist. It receives and acts upon requests from the local processor, as well as sends and receives (and acts on) requests sent to/from other nodes as network transactions through the communication assist.

Unlike bus based multiprocessor systems, the coherence controllers are not connected with a medium that allows for (serialized) communication nor bus signal lines, such as the SHARED line (which is asserted in a bus based system when another processor has a copy of that cache block which is being addressed). In bus based systems, the bus is also the medium in which invalidations or updates are sent to other coherence controllers, depending on the coherence protocol. Further, bus based systems allow for snooping of requests from other coherence controllers such as read, read-exclusive, flushes, etc. Since no bus exists, but invalidations or updates have to be sent to other coherence controllers, these are sent as network transactions. Additionally, since no bus exists, it isn't guaranteed that a request will be seen by other processors once it is sent, so acknowledgment messages are also sent as network transactions in response to requests.

DSM based systems do not replicate the broadcasting of messages to other coherence controllers as bus based systems do because the bandwidth requirements would be prohibitively large. Many DSM systems utilize a construct called a directory that stores information about which cache block is cached in which state by the different nodes to avoid having to broadcast invalidations, updates, upgrades, interventions, flushes, or other messages sent by coherence controllers on buses. The directory enables a node to select a subset of nodes as message recipients intelligently, thereby reducing the network traffic.

To ensure that a directory is properly updated and reflects the true state of the caches, the directory has its own coherence protocol which responds to read, read exclusive (write), and upgrade requests from the different nodes, and sends messages including invalidations to sharer nodes, replies with and without data to nodes, and interventions to nodes. Solihin [1, p. 332-352] covers directory based coherence protocols in further detail.

Since each memory block address maps to a specific node (this mapping is generally determined at boot time), then a new term, home node, is introduced to signify the node which houses a specific block in memory. One naive implementation of directories is to have a centralized directory that exists at one node, but this suffers from performance problems since that node becomes a bottleneck for all transactions, so a logical improvement is to utilize decentralized directories, where each home node maintains a directory for its blocks. Since the mapping for a block is fixed, then any node knows immediately which is the home node, and only needs to send requests to that node directly. Solihin [1, p. 325-327] covers home nodes and directory placement in further detail.

Memory consistency

DSM systems must also maintain memory consistency just as bus based systems. Memory consistency is concerned with how all processors in the system see the ordering of loads and stores [1, p.283]. If one or more processors see the loads and stores as occurring in different orders, it is possible that each develops an inconsistent view of memory. Systems maintain memory consistency by implementing memory consistency models, which are discussed in detail below.

The need for memory consistency can be explained through an example concerning programmers intuition [1]. Suppose there is a program with three threads that has a cascading series of locks, as shown below, which is inspired by an example from [1, p. 285].

All variables are initialized to 0.

P0:                                      P1:                                      P2:
s1: a = 1;                               s3: while(!a_rdy) {}                     s6: while(!ab_rdy) {}
s2: a_rdy = 1;                           s4: b = a + 1;                           s7: c = a * b;
                                         s5: ab_rdy = 1;

On the surface, it would be expected that P0 sets a value to a and sets a_rdy. P1 spins until a_rady is set, at which point it assigns b = a + 1 = 2, and sets a flag indicating a and b are ready. P2 spins on ab_rdy until P1 sets it, when P2 finally sets c = a * b = 1 * 2 = 2.

Looking deeper, it is clear that this is only expected behavior because it is expected that s1 precedes s2 (s1->s2), and s3->s4->s5, and s6->s7. Expecting that the program execute as this is called the expectation of program order, which Solihin explains as "memory accesses are executed in a thread to follow the order in which they occur in the source code" [1, p. 285].

The other expectation is that the memory accesses are atomic, in that the memory accesses (namely writes) occur without overlapping with other memory accesses. An example of memory accesses being non-atomic (or overlapping) is from the example above. Consider that P0 sends "a" to P1 and P2 and the value arrives at P1 immediately, but due to an interconnection delay, takes longer time to propagate to P2. It is possible that P2 executes s7 and updates the value of c without having received the new value of "a" from P0, thereby setting c = a * b = 0 * 2 = 0, which is not the expected result. [1]

Combining the expectation of write atomicity and program order define the sequential consistency memory model.

Other relaxed memory consistency models also exist, but all serve to give the programmer an understanding of the memory consistency provided by the system to ensure that programs produce the expected output. More info on memory consistency models can be found in Solihin [1].

Interconnections

DSM systems consist of nodes that are connected together via interconnects. The collection of all interconnects combined is analogous to a bus on a bus-based multiprocessor system, but in the case of DSM systems, these interconnects are individual discrete connections.

Interconnects are the media through which cache coherence messages, such as invalidations, upgrades, cache to cache transfers, etc must be sent. Unlike in bus based systems where all processors exist on the same bus and will generally see things occur on the bus at the same time and in the same order, interconnects are point to point connections, so it is likely that a message sent from one node to two others will not arrive at the others at the same time nor in order.

Since the interconnects do not preserve serialization, protocols must be developed that dictate that messages must be sent, but where in a bus based system it could be taken for granted that if a message is put on the bus that it will be received, it can't be taken for granted in interconnects, thus the protocols must include a series of acknowledgement messages as responses to initiated messages. Since messages must be sent as discrete packets, these messages incur latency in being created, and since they must route from point to point, and sometimes are relayed through nodes as a series of responses / exchanges, extra latency can be incurred for a transaction that would've been much faster on a bus based system.

Special care must be taken to understand the latency and design it through protocol implementations.

Performance Concerns

A DSM machine has unique requirements compared to shared memory / bus based machines in order to provide cache coherence and memory consistency, as well as having interconnects. Performance concerns of each of these concepts in DSM machines are discussed in detail.

Maintaining cache coherence

To maintain correct cache coherence, write propagation and write serialization must be provided, both of which can have adverse effects on performance.

Write serialization requires that all writes to a memory location be seen in the same order by other processors. Earlier, an example was given indicating how write serialization can be violated by observing writes out of order. A naive implementation of write serialization would require that a request and all it's messages are performed atomically to avoid overlapping of requests [1, p. 338]. Solihin [1, p. 342-344] discusses correctness issues that can occur if requests are allowed to overlap without special consideration. A non-overlapping approach would require that each request has conditions defined that indicate when it was begun and when it ends, in order for the home node to observe and wait for completion prior to processing other requests to the same block.

The performance concern of disallowing overlapping of requests is that subsequent read or write operations to the same block would be delayed from initiating, even if some of the messages within the requests can be overlapped without correctness concerns.

Another performance problem that can arise through cache coherence are false sharing misses. False sharing can be explained by an example. Suppose two processors have a cache block caches in the shared state, but processor A is reading and writing to a variable x within this block, and processor B is reading and writing to a variable y within this block. Although both processors are not attempting to access each others variables, since both variables map to the same cache block, then each processor either invalidates or sends updates to the other without the other actually needing the data. in a DSM system, these invalidations or updates can unnecessarily utilize a significant amount of interconnect bandwidth.

Maintaining memory consistency

Depending on the memory consistency model that is being enforced, performance can be lost be having to ensure various degrees of atomicity of memory accesses and program ordering.

For sequential consistency, to follow program order requires that the program executes statements in the order defined by the source code for a thread. The implication is that statements within a thread cannot be executed out of order, so compiler optimizations and processor optimizations that include out of order execution as an attempt to reduce the latency of individual instructions and increase instruction level parallelism on pipelined architectures must be avoided to varying degrees. [1, p. 293]

For atomicity of memory accesses to be seen by all processors, special considerations need to be made for DSM systems. In general, all processors must be able to detect when a load or a store has completed. For a store, or write atomicity, on a bus based system, the completion can be assumed as occurring as soon as a read exclusive request reach the bus. This is because all processors on the bus are guaranteed to see the request and will immediately invalidate themselves, resulting in the next read to the location becoming a read miss, requiring the block to be re-cached with the most up to date value. On a DSM system, however, there is no bus, so a write cannot be assumed complete as soon as invalidations are sent on the network. Rather, they must wait until acknowledgements of the invalidations are received which can take many hops and incur high latency, especially if there is network congestion. [1, p.292]

Overall, memory consistency on DSM systems can require latencies while waiting for acknowledgements to see completion of writes and loss of performance from in-order execution.

Relaxed memory consistency models are techniques normally used to alleviate performance concerns, and are discussed in detail as improvements in specific DSM systems.

Latency of interconnections

A mentioned in the cache coherency and memory consistency sections, interconnections unique distinguish DSM systems from bus based systems. Interconnections are unlike a bus in that they do not guarantee that messages reach recipients, and certainly aren't seen by the receivers at the same moment. Each message must be sent as a transaction in a networking protocol, and each packet sent has at least the latency of hops through routers in the network, and can also incur latency in being generated.

Since messages can become ubiquitous if the DSM system was naively designed to perform identically to a bus based system, care must be taken to design coherence protocols and consistency models that minimize the sending and receiving of messages, and if they must be sent or received, allow for overlapping of execution without blocking while waiting for messages or their receipt.

Performance Improvements

Architectures and programming models have been created that allow DSM machines to alleviate some of the performance issues inherent in DSM vs. bus based systems while also avoiding the overheads of maintaining cache coherence and memory consistency, and the latency of interconnects.

Two examples are used to illustrate performance improvements. The first example is the Cray X1 system which is examined for it's ability to alleviate the impact of the interconnect latency and reduce coherence misses [2]. The second example is the Shasta, a software based DSM cache coherence protocol that improves cache coherence performance by reducing false sharing through variable page sizes, reduces latency by reducing interconnect traffic, and reduces latency by supporting upgrades [5].

Cray X1 Supercomputer

The Crazy X1 is examined for it's hardware architecture and software interfaces that allow for cache coherence performance and interconnect performance to be improved beyond simple DSM machines.

The Cray X1 architecture consists of nodes consisting of 4 x MSPs (multi-streaming processors) that each consist of 4 x SSPs (single-streaming processors), 4 x Ecaches totaling 2 MB, and a local shared memory of either 16 or 32 GB [3, CH 7]. The SSPs are made up of 2 vector processors and one superscalar processor. The nodes are connected together with X1 routing modules in a modified 2D torus configuration for larger configurations, or a 4-D hypercube for smaller configurations up to 128 nodes [2].

The X1 is interesting in the way it approaches the problems of performance in that it is a hybrid SMP/DSM system. Thus, on the level of a node, it is possible to gain the performance of a SMP system and not have to suffer with non-uniform memory access times for remote accesses [2].

Pages of memory have their virtual to physical memory mappings stored in a TLB (translation lookaside buffer) for local memory accesses within a node. Like in most page table schemes in other architectures, TLBs are needed because processes are given virtual address spaces which may span many physical memory locations (or page tables). If an access is made to a virtual address, the TLB provides a quick way to determine exactly where the physical memory resides, rather than resorting to walking a page table to determine a hit [3 - CH 7, 4]. TLBs act like a cache for virtual to physical address translations, reducing the latency in accessing memory for SPP Ecache misses.

The X1 also provides RTTs (remote translation tables), which provide the same functionality as a TLB but for non-local (off-node) memory references.

Improving on interconnect latency

The first area of performance improvements that this architecture realizes is in reducing the impact of the latency of interconnections. In a traditional DSM system, remote memory accesses would be passed over an interconnect and would suffer the asymmetric latency of NUMA. In the X1, programs can configure and dictate how nodes are used [3, CH 1].

In one configuration, an application executes within a node. The result is that it executes as if it were on an SMP system with all accesses to memory being local and thus being uniform latency, avoiding the latency of the interconnections to other nodes [3, CH 1]. This configuration allows the application programmer to use APIs such as OpenMP and utilize shared memory programming models within a node. Since this configuration limits the resources to four MSPs, the programs must be capable of achieving desired performance in four MSPs or, in the context of programs the require more than one node (using message passing with MPI across nodes), must have non-overlapping of memory accesses between nodes. Most scientific applications exhibit little data sharing, but those that do require data sharing see a performance improvement from running within a node and sharing the memory at the local node level through a simple node configuration. [3]

In another configuration, an application execution spans multiple nodes. Through this configuration, all address translations for remote nodes are handled through the TLBs emulating RTTs, but sacrifices some of the performance and gains extra flexibility over the execution within a node. [3]

Improving on cache coherence

Cache coherence is improved on by enabling a larger amount of cache hits and reducing cache misses for the types of applications that would tend to be executed on the Cray X1.

One method of improving results based on cache coherence is through vectorizing calculations to minimize latency. Since most scientific applications will involve floating point operations on data that don't have many dependences, then these applications can have their performance improved by the 32 element vector processors within the SSPs. In order to achieve high performance for vectorized programs, it is important to keep the vector pipeline full [3].

To aid in this, the Cray Fortran and C compilers will automatically vectorize loops as much as possible if they can determine that there are no data dependences and the calculations fit within an MSP. If the compiler produces suboptimal results because it is using static code analysis, then it is possible, through algorithm analysis, for the application programmer to utilize compiler directives to force a different optimization (including preferring vectorization for one loop over another) to improve performance from vectorization and maximizing cache hits. [3, CH 7]. More detail on these techniques can be found in the [3, CH. 7.4].

Shasta distributed shared memory protocol

Shasta is a DSM protocol that allows for multiple levels of granularity for pages in lieu of standard page sizes. Shasta is a software implementation that creates a shared address space across machines with distributed memory, thus it represents a pure software approach to solving DSM performance issues [5].

The motivation behind Shasta was to provide an alternative to SVM (shared virtual memory) systems which have hardware that produces a shared address space by sharing memory in a fixed page size, and which maintain coherence via interconnects. Shasta is designed to be flexible, enables implementation of "cache coherence protocols to improve parallel performance" in software, and does all this with less coherence messages than common DSM systems in hardware [5].

The cache coherence protocol is a directory based invalidation protocol. The directory entries contain pointers to the owner processor and full bit vectors for sharer processors. The directories are decentralized and maintained at the different nodes (acting as home nodes). The shared address space is divided into blocks (of multiple sizes, as will be explored later), and each block itself is divided into fixed size lines, where each line has an entry in a state table. Coherence is maintained at the block level, but miss checks are simplified since they only need to look at the state table entry for that line directly [5].

Since Shasta is a software approach, a Shasta compiler inlines the instructions needed to adhere to the protocol, including those for miss checks prior to loads and stores.

Improving on cache coherence and interconnect latency

Shasta improves on performance of cache coherence by reducing false sharing misses. To accomplish this, the protocol supports subdividing the shared address space into ranges, where different ranges support different block sizes. Dynamic memory allocations are done in the shared memory space, so the protocol can intelligently select the correct range which has a properly matched page size from which to allocate the block. Multiple levels of granularity enable this fine grain sharing and reduce false sharing misses, thereby increasing performance in coherence protocols, without the complexity of dealing with false sharing directly [5].

Shasta also improves on cache coherence performance by reducing the overhead of miss checks through various means.

First, it was mentioned that dynamic memory allocations are shared, but it was not explicitly stated that static allocations on the stack are not shared. The fact that these are allocated on the stack allows the miss check code to verify if the base register of a load or store uses or is calculated from the stack pointer or global pointer, and if so, it is static, available, and shortcuts the remainder of the miss check. [5]

Second, the protocol places a special invalid flag value into a line when it is invalidated. This allows for a miss check to quickly compare the line value with the flag value and thus determine if it is invalid or not to shortcut the miss check. If they are equal, the miss check continues to perform a state table check on the line to determine if it is legitimately a miss or if the line contents just happened to be valid but equaled the invalidation flag (known as a "false miss"). As the Shasta authors claim, false misses almost never occur (due to the pattern of the invalidation flag being written into every long word of the line, which is irregular). These techniques lead to large reductions in miss check time [5].

Third, miss checks for multiple loads and stores are groups, or batched. The Shasta authors provide an example (paraphrased here) in which consecutive loads and stores that are all relative to one base memory location and span a range less than or equal to a line. In this case, the most number of consecutive lines that are being accessed is two. If inline checks show that these lines are not invalid, then the loads and stores can continue without performing further checks, thereby reducing the latency of the miss check. Special techniques exist for checking the two consecutive line states with inline checks by using the base register and offsets. More detail on the implementation can be found in [5].

Another method to improve on cache coherence is through the protocol requests. The protocol is a directory based invalidation protocol that supports dirty sharing. The three types of requests that the protocol supports are read, read exclusive (write), and upgrade. The upgrade request, known as "exclusive" in the protocol, enables a processor to convert a shared block into a modified block without having to perform a read exclusive, thereby reducing the latency by that of the read-exclusive. [5]

Further improvements to the protocol requests are done by requiring that a home node is guaranteed to act on every request it receives. By doing so, the point at which a request is complete can be considered the time that the request reaches the home node rather than waiting for receipt of acknowledgements from the different nodes. This allows for the latency of a coherence request to be reduced [5].

Improving on memory consistency

Shasta improves performance with memory consistency by providing mechanisms to support release consistency. Release consistency relaxes the requirements of sequential consistency to only require that synchronization accesses be sequentially consistent, that all loads and stores issued before an acquire are completed before the acquire while nothing after the acquire begin until after the acquire completes, and that all loads and stores prior to a release complete before the release executes, while subsequent accesses do not need to wait for the release to complete.

The protocol supports non-blocking loads and stores by adding transient states for lines: pending-invalid, which happens in response to a read or read exclusive, or pending-shared, which happens in response to an upgrade. This allows for subsequent memory access to continue for other lines, but not for the line in a transient state, so latency can be reduced for other accesses. [5]

The protocol also support non-blocking releases, which will delay a release but not block subsequent accesses from occurring before the release is complete. This allows for less latency in executing statements immediately after (or exiting) a critical section. [5].

Shasta also supports what is known as lockup-free behavior to lines. For example, a write to a line that is pending is allowed to complete by placing the new value into memory and the location into a miss-handler. This mis-handler is used on subsequent accesses to that line. This keeps subsequent writes from having to wait for the first write to complete, reducing the latency in those statements. Another lockup free feature is that reads can be services immediately is a line is in pending-shared since it is assumed to already have the latest value [5]. More detail can be found in [5].

Definitions

Under construction

communication assist
FIXME
directory
FIXME
DSM
Distributed shared memory, a parallel computer architecture which consists of a set of nodes that maintain their own local memory, but all nodes are connected together, making their memories one shared addressable space.
Ecache
FIXME
granularity
FIXME
home node
FIXME
MSP
Multi-streaming processor used in the Cray X1 architecture. MSPs consist of 4 SSPs and 4 x 0.5 MB Ecaches. A set of four MSPs is equivalent to a node and connects using an X1 routing module.
network transactions
FIXME
node
A compute unit that makes up one components of a DSM system. A node consists of one or more sets of processors, cache, and memory. A node is connected to the larger DSM system through an interconnect. [2]
RTTs
remote-translation tables. RTTs are similar to TLBs for the Cray X1 architecture, but contain translation for non-local references.
Shasta
A distributed shared memory protocol that allows for multiple levels of granularity in pages to reduce the occurrence of false sharing cache misses [5].
SMP
shared-memory processor. A parallel computer architecture in which all processors share a common bank of memory and have uniform access time to other processor caches.
SSP
Single-streaming processor used in the Cray X1 architecture. An SSP consists of one 2-way superscalar unit and two 32 stage 64-bit floating point vector units. Four SSPs exist in each MSP. [2]
TLB
translation lookaside buffer. TLBs are cache-like data structures that contain the mapping from a virtual memory address to a physical memory address.
write propagation
FIXME
write serialization
FIXME

Suggested Reading

Under construction

Directory based coherence protocols
Solihin [1, p. 332-352] FIXME

References

Under construction

1) Y. Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Publishing & Consulting, LLC, 2009.

2) IEEE Micro cray X1 (ieee-micro-cray-x1-v8.pdf from Gehringer's wiki topics email)

3) Optimizing Applications on the Cray X1 system (http://docs.cray.com/books/S-2315-52/html-S-2315-52/S-2315-52-toc.html)

4) Wikipedia article on TLB - ( http://en.wikipedia.org/wiki/Translation_lookaside_buffer )

5) Performance of the Shasta Distributed Shared Memory Protocol (WRL-97-2.pdf from Gehringer's wiki topics email)

<references/>

Quiz

Under construction

1) Which of the following are necessary for maintaining a correct
cache coherence:
a) write invalidation & write update
b) write serialization & write propagation 
c) write serialization & write-through
d) write allocation & write buffer

2) Which of the following will NOT help the Cray X1 supercomputer
improve performance?
a) Reduce the impact of latency from interconnections
b) Vectorizing calculations to minimize latency.
c) Reducing the overhead of miss checks 
d) Enabling larger amounts of cache hits while reducing cache misses

3) For the Cray X1 supercomputer, which of the following has the same functionality as an RTT?
a) SMP
b) MSP
c) DSM
d) TLB 

4) Which of the following is NOT true?
a) Shasta represents a pure software approach to solving DSM performance issues.
b) The motivation behind Cray X1 was to create a SVM.
c) Shasta allows for multiple levels of granularity for pages in lieu of standard page sizes.
d) The SSPs of the Cray X1 are made up of 2 vector processors and one superscalar processor.

5) When improving the ''latency'' of a Cray X1 supercomputer, which of the following happens during the configuration where an application execution spans multiple nodes?
a) All address translations for remote nodes are handled through the TLBs emulating RTTs	
b) Allows the application programmer to use APIs such as OpenMP	
c) Utilize shared memory programming models within a node. 
d) Limits the resources to four MSPs

6) True or False: Shasta supports what is known as lockup-free behavior to lines.
a) True
b) False

7) True or False: Sequential consistency is a type of memory consistency that requires atomic memory accesses and enforces program order of memory accesses across all processors.
a) True
b) False

8) Distributed shared memory systems utilize interconnects between ____.
a) routers
b) processors
c) nodes
d) cache coherence controllers

9) A performance concern of maintaining correctness in cache coherence is ______.
a) write serialization
b) write-back
c) invalidation

10) True or False: Write propagation on DSM machines can always be treated as on bus based systems where a write is seen when an invalidation is sent to sharers and a read is complete when data is returned to the requestor.
a) True
b) False

Answer Key

1) b. 2) c. 3) d. 4) b. 5) a. 6) a. 7) b. 8) c. 9) a. 10) b.