CSC/ECE 506 Spring 2012/11a ht
Introduction
When dealing with a relatively small number of processors (8-16), according to Solihin 320, using a bus based shared memory structure is fine. Unfortunately, when you need to provide a shared memory structure for processors much greater than that, you will need a different set of organization. This new organization is needed due to the physical limitations of the bus. There are two ways you can create such a system. These include Distributed Shared Memory (DSM) or Non-Uniform Memory Access (NUMA). The benefits of having a DSM and NUMA is that we can now scale to a larger amount of processors. The disadvantage is that scaling in such a way may not be the most cost-effective solution, Solihin 320. For the remainder of this section, we will be discussing the performance of DSM's.
According to Solihin 320, there are two aspects that restrict the scalability of bus-based multiprocessors. These include the physical limitations of interconnections and the limitations of the protocol. To explain in detail, on a bus-based system, adding a processor will not affect any other physical restrictions on the system. Unfortunately, when adding a new processor, you will be reducing the speed of the bus. Second, the protocol needed to keep coherence does not scale well. As you increase the number of processors to the system, the amount of traffic also increases. This means that you might run the risk of overwhelming the bandwith. According to Solihin, there are a few ways that we can mitigate this problem. The following is from 321 of the Solihin textbook.
From the table, we can see that there is three ways to scale a multiprocessor system. The first being a single bus system. This is the least scalable due to the limitations of the bus wire itself. As you add processors you will decrease the bus speed due to having to increase the wire length. Also, you run into an issue of overwhelming the bus due to the amount of traffic. The second way is to use a point-to-point bus system. This allows for the speed of the bus to remain relatively fast, but since the traffic will also scale with the number of processors, there will be a limitation due to overwhelming the bus system with traffic. Lastly, the most scalable system to date is using a directory system. This allows for the bus to remain fast due to the short wires, and the bus traffic to remain low since the directory holds information on cache locations.
Optimizations
Shasta Protocol
The Shasta protocol improves on the regular software disributed shared memory system such that the shared data can be kept coherent at fine granularity. This is implemented by inserting inline code that checkes the cache state of shared data before each load or store. This protocol also allows the coherence granularity to be varied across different shared data structures in a single application, thus alleviating any potential inefficiencies that arise from the fixed large(page-size) granularity of the communication typical in most software shared memory systems. As in the hardware cache-coherent multiprocessors, shared data in this systems has three states - Invalid, Shared and Exclusive. This protocol provides a number of mechanisms for dealing with the long communication latencies in a workstation cluster. It minimizes extraneous coherence messages, and hence requires fewer messages to satisfy shared memory operations compared to protocols commonly used in hardware DSM systems. It also includes optimizations such as non-blocking stores, that aggressively exploit a relaxed memory consistency model. Other optimizations include detection of migratory data sharing, issuing multiple load misses simultaneously, merging of load and store misses to the same cache line, and support for prefetching and home placement directives.
Multiple Coherence Granularity
The Shasta protocol can support multiple granularities for communication and coherence, even with a single application. This can provide a significant performance boost in a software DSM system, since data with good spatial locality can be communicated at a coarse grain to amortize large communication overheads , while data prone to false sharing can use a finer sharing granularity. The implementation automatically chooses a block size based on the allocated size of a data structure. The basic heuristic is to choose a block size equal to the object size up to a certain threshold; the block size for objects larger than a given threshold is simply set to the base Shasta line size (typically set to be 64 bytes). The rationale for the heuristic is that small objects should be transferred as a single unit, while larger objects (e.g. large arrays) should be communicated at a fine granularity to avoid false sharing. Since the choice of the block size does not affect the correctness of the program, the programmer can freely experiment with various block sizes to tune the performance of an application. Controlling the coherence granularity in this manner is significantly simpler than approaches adopted by object-based or region-based DSM systems , since the latter approaches can affect correctness and typically require a more substantial change to the application.
Minimization of Protocol Messages
The Shasta protocol is designed to minimize extraneous coherence messages, given the relatively high overheads associated with handling messages in software DSM implementations. In this protocol, the current owner node specified by the directory guarantees to service a request that is forwarded to it. The fact that the current owner guarantees to service a request that is forwarded to it allows the protocol to complete all directory state changes when a requestor first reaches the home. This property eliminates the need for extra messages that are sent back to the home to confirm that the forwareded request is satisfied. Since this protocol supports dirty sharing, it also eliminates the need for sending an up-to-date copy of the line back to the home in case of read transaction when the home node is remote and the data is dirty in another node. Supporting exclusive requests reduces the need for fetching data on a store if the requesting processor already has the line in shared state. Also, the number of invalidation acknowledgements that are expected for an exclusive request are piggybacked on on of the invalidation acknowledgements to the requestor instead of being sent as a separate message.
Exploitation of Relaxed Memory Models
According to <ref name="nitzberg">Nitzberg, B.; Lo, V.; , "Distributed shared memory: a survey of issues and algorithms," Computer , vol.24, no.8, pp.52-60, Aug. 1991 doi: 10.1109/2.84877 paper</ref> DSM has been researched since the 1980's. There are many reasons why DSM has been an area of research focus. Uniprocessor bus-based systems suffer from a hardware and software limitation that can be mitigated using a DSM. But, this has only become an issue as we start having more and faster processors that we want to add to the system. According to Nitzberg<ref name"nitzberg"></ref>, there are three approaches that have been used to implement a DSM system.
References
<references />