ECE506 CSC/ECE 506 Spring 2012/11a az: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 65: Line 65:
<b>Under construction</b>
<b>Under construction</b>


The Cray X1 architecture consists of nodes consisting of 4 x MSPs (multi-streaming processors) that each consist of 4 x SSPs (single-streaming processors) and 4 x Ecaches. The nodes are connected together with X1 routing modules in a modified 2D torus configuration for larger configurations, or a 4-D hypercube for smaller configurations up to 128 nodes [2].  
The Cray X1 architecture consists of nodes consisting of 4 x MSPs (multi-streaming processors) that each consist of 4 x SSPs (single-streaming processors), 4 x Ecaches totaling 2 MB, and a local shared memory of either 16 or 32 GB [3, CH 7]. The SSPs are made up of 2 vector processors and one superscalar processor. The nodes are connected together with X1 routing modules in a modified 2D torus configuration for larger configurations, or a 4-D hypercube for smaller configurations up to 128 nodes [2].
 
The X1 is interesting in the way it approaches the problems of performance in that it is a hybrid SMP/DSM system. Thus, on the level of a node, it is possible to gain the performance of a SMP system and not have to suffer with non-uniform memory access times for remote accesses [2].
 
Pages of memory have their virtual to physical memory mappings stored in a TLB (translation lookaside buffer) for local memory accesses within a node. Like in most page table schemes in other architectures, TLBs are needed because processes are given virtual address spaces which may span many physical memory locations (or page tables). If an access is made to a virtual address, the TLB provides quick way to determining exactly where the physical memory resides, rather than resorting to walking a page table to determine a hit [3 - CH 7, 4]. TLBs act like a cache for virtual to physical address translations, reducing the latency in accessing memory for SPP Ecache misses.
 
The X1 also provides RTTs (remote translation tables), which provide the same functionality as a TLB but for non-local (off-node) memory references.
 
=== Improving on interconnect latency ===
 
The first area of performance improvements that this architecture realizes is in reducing the impact of the latency of interconnections. In a traditional DSM system, remote memory accesses would be passed over an interconnect and would suffer the asymmetric latency of NUMA. In the X1, programs can configure and dictate how nodes are used [3, CH 1].
 
In one configuration, an application executes within a node. The result is that it executes as if it were on an SMP system with all accesses to memory being local and thus being uniform latency, avoiding the latency of the interconnections to other nodes [3, CH 1]. This configuration allows the application programmer to use APIs such as OpenMP and utilize shared memory programming models. Since this configuration limits the resources to four MSPs, the programs must be capable of achieving desired performance in four MSPs or, in the context of programs the require more than one node, must have non-overlapping of memory accesses between nodes, resulting in segregation of data and possible reduction (or independence) of results. Most scientific applications exhibit little data sharing, but those that do require data sharing see a performance improvement from running within a node and sharing the memory at the local node level through a simple node configuration.
 
In another configuration, an application executes on multiple nodes.


== Maintaining cache coherence ==
== Maintaining cache coherence ==

Revision as of 23:26, 15 April 2012

11a.  Performance of DSM systems.  Distributed shared memory systems combine the programming models of shared memory systems, and the scalability of distributed systems.  However, since DSM systems need extra coordination between software layer and underlying hardware, achieving good performance could be a big challenge. The factors that harm the performance could be the overhead to maintain cache coherence, memory consistency, and the latency of interconnections. Please further explore the factors that can affect the performance of DSM systems, and the improvements that have been made on the existing systems.

Introduction

Under construction

Cache coherence

DSM systems must maintain cache coherence just as it required by bus-based multiprocessor systems. Cache coherence problems arise when it is undefined how a change of a value in a specific processor's cache is propagated to the other caches [1, p. 183]. If multiple processors access and modify a shared location in memory and produce outputs based on that shared variable, it is possible to calculate incorrect values if cache coherence is not maintained.

Ensuring that a value changed in one cache is sent to another cache is called write propagation. [1, p. 183] Write propagation is one of the requirements that must be addressed to be provide cache coherency. Without write propagation, one processor could modify a cached value and not notify the other processors that have the same value cached. The other caches may believe they have the latest data, thus on subsequent reads, their caches will provide it to their respective processors, leading to incoherent results.

FIXME: EXAMPLE?

Another requirement for cache coherence is write serialization, which Solihin [1 p. 183] defines as a requirement that "multiple changes to a single memory location are seen in the same order by all processors". If two processors perform writes to a single memory location in a certain order, then all other processors in the system should see the writes (by reading that memory location and subsequently caching the values) in the order in which they were written. If other processors observe the writes by reading the variable, but see the writes in different orders, this can lead to incoherent copies of the same variable in multiple caches while each think they have the latest copy.

FIXME: EXAMPLE?

Thus, for correctness purposes, it is required that write propagation and write serialization are provided by the cache coherence implementation.

In order to maintain cache coherence, a cache coherence protocol is implemented in hardware (or in specific cases, in software). In DSM systems, the cache coherence controller in a node interfaces with the processor and it's cache, but also has a communication link to the other nodes through an interconnect network via a communication assist. It receives and acts upon requests from the local processor, as well as sends and receives (and acts on) requests sent to/from other nodes as network transactions through the communication assist.

Unlike bus based multiprocessor systems, the coherence controllers are not connected with a medium that allows for (serialized) communication nor bus signal lines, such as the SHARED line (which is asserted in a bus based system when another processor has a copy of that cache block which is being addressed). In bus based systems, the bus is also the medium in which invalidations or updates are sent to other coherence controllers, depending on the coherence protocol. Further, bus based systems allow for snooping of requests from other coherence controllers such as read, read-exclusive, flushes, etc. Since no bus exists, but invalidations or updates have to be sent to other coherence controllers, these are sent as network transactions. Additionally, since no bus exists, it isn't guaranteed that a request will be seen by other processors once it is sent, so acknowledgement messages are also sent as network transactions in response to requests.

DSM based systems do not replicate the broadcasting of messages to other coherence controllers as bus based systems do because the bandwidth requirements would be prohibitively large. Many DSM systems utilize a construct called a directory that stores information about which cache block is cached in which state by the different nodes to avoid having to broadcast invalidations, updates, upgrades, interventions, flushes, or other messages sent by coherence controllers on buses. The directory enables a node to select a subset of nodes as message recipients intelligently, thereby reducing the network traffic.

To ensure that a directory is properly updated and reflects the true state of the caches, the directory has its own coherence protocol which responds to read, read exclusive (write), and upgrade requests from the different nodes, and sends messages including invalidations to sharer nodes, replies with and without data to nodes, and interventions to nodes. Solihin [1, p. 332-352] covers directory based coherence protocols in further detail.

Since each memory block address maps to a specific node (this mapping is generally determined at boot time), then a new term, home node, is introduced to signify the node which houses a specific block in memory. One naive implementation of directories is to have a centralized directory that exists at one node, but this suffers from performance problems since that node becomes a bottleneck for all transactions, so a logical improvement is to utilize decentralized directories, where each home node maintains a directory for its blocks. Since the mapping for a block is fixed, then any node knows immediately which is the home node, and only needs to send requests to that node directly. Solihin [1, p. 325-327] covers home nodes and directory placement in further detail.

Memory consistency

Under construction

Interconnections

Under construction

Performance Concerns

Under construction

Maintaining cache coherence

To maintain correct cache coherence, write propagation and write serialization must be provided, both of which can have adverse effects on performance.

Write serialization requires that all writes to a memory location be seen in the same order by other processors. Earlier, an example was given indicating how write serialization can be violated by observing writes out of order. A naive implementation of write serialization would require that a request and all it's messages are performed atomically to avoid overlapping of requests [1, p. 338]. Solihin [1, p. 342-344] discusses correctness issues that can occur if requests are allowed to overlap without special consideration. A non-overlapping approach would require that each request has conditions defined that indicate when it was begun and when it ends, in order for the home node to observe and wait for completion prior to processing other requests to the same block.

The performance concern of disallowing overlapping of requests is that subsequent read or write operations to the same block would be delayed from initiating, even if some of the messages within the requests can be overlapped without correctness concerns.

Maintaining memory consistency

Under construction

Latency of interconnections

Under construction

Performance Improvements

Under construction

Cray X1

Under construction

The Cray X1 architecture consists of nodes consisting of 4 x MSPs (multi-streaming processors) that each consist of 4 x SSPs (single-streaming processors), 4 x Ecaches totaling 2 MB, and a local shared memory of either 16 or 32 GB [3, CH 7]. The SSPs are made up of 2 vector processors and one superscalar processor. The nodes are connected together with X1 routing modules in a modified 2D torus configuration for larger configurations, or a 4-D hypercube for smaller configurations up to 128 nodes [2].

The X1 is interesting in the way it approaches the problems of performance in that it is a hybrid SMP/DSM system. Thus, on the level of a node, it is possible to gain the performance of a SMP system and not have to suffer with non-uniform memory access times for remote accesses [2].

Pages of memory have their virtual to physical memory mappings stored in a TLB (translation lookaside buffer) for local memory accesses within a node. Like in most page table schemes in other architectures, TLBs are needed because processes are given virtual address spaces which may span many physical memory locations (or page tables). If an access is made to a virtual address, the TLB provides quick way to determining exactly where the physical memory resides, rather than resorting to walking a page table to determine a hit [3 - CH 7, 4]. TLBs act like a cache for virtual to physical address translations, reducing the latency in accessing memory for SPP Ecache misses.

The X1 also provides RTTs (remote translation tables), which provide the same functionality as a TLB but for non-local (off-node) memory references.

Improving on interconnect latency

The first area of performance improvements that this architecture realizes is in reducing the impact of the latency of interconnections. In a traditional DSM system, remote memory accesses would be passed over an interconnect and would suffer the asymmetric latency of NUMA. In the X1, programs can configure and dictate how nodes are used [3, CH 1].

In one configuration, an application executes within a node. The result is that it executes as if it were on an SMP system with all accesses to memory being local and thus being uniform latency, avoiding the latency of the interconnections to other nodes [3, CH 1]. This configuration allows the application programmer to use APIs such as OpenMP and utilize shared memory programming models. Since this configuration limits the resources to four MSPs, the programs must be capable of achieving desired performance in four MSPs or, in the context of programs the require more than one node, must have non-overlapping of memory accesses between nodes, resulting in segregation of data and possible reduction (or independence) of results. Most scientific applications exhibit little data sharing, but those that do require data sharing see a performance improvement from running within a node and sharing the memory at the local node level through a simple node configuration.

In another configuration, an application executes on multiple nodes.

Maintaining cache coherence

Under construction

Maintaining memory consistency

Under construction

Relaxed memory models with fine granularity coherence

Under construction

Latency of interconnections

Under construction

Definitions

Under construction

communication assist
FIXME
directory
FIXME
DSM
Distributed shared memory, a parallel computer architecture which consists of a set of nodes that maintain their own local memory, but all nodes are connected together, making their memories one shared addressable space.
Ecache
FIXME
granularity
FIXME
home node
FIXME
MSP
Multi-streaming processor used in the Cray X1 architecture. MSPs consist of 4 SSPs and 4 x 0.5 MB Ecaches. A set of four MSPs is equivalent to a node and connects using an X1 routing module.
network transactions
FIXME
node
A compute unit that makes up one components of a DSM system. A node consists of one or more sets of processors, cache, and memory. A node is connected to the larger DSM system through an interconnect. [2]
SSP
Single-streaming processor used in the Cray X1 architecture. An SSP consists of one 2-way superscalar unit and two 32 stage 64-bit floating point vector units. Four SSPs exist in each MSP. [2]
write propagation
FIXME
write serialization
FIXME

Suggested Reading

Under construction

Directory based coherence protocols
Solihin [1, p. 332-352] FIXME

References

Under construction

<references/>

Quiz

Under construction