CSC/ECE 506 Spring 2010/ch 6 PP

From Expertiza_Wiki
Revision as of 23:40, 24 February 2010 by Ppratha (talk | contribs)
Jump to navigation Jump to search

Cache Structures of Multi-Core Architectures

Overview

With the advent of multicore and many core architectures, we are facing a problem that is relatively new to parallel computing, namely, the management of hierarchical parallel caches. This chapter describes some of the mainstream memory organizations in multiprocessor architectures. It also describes some of protocols these caches use for replacement and write between multiple caches.

Shared Memory Multiprocessors

Scalable shared-memory multiprocessors are emerging as attractive platforms for applications with high-performance demands. What makes these machines attractive is the shared address space, which allows processors in a multiprocessor to share data the same way it is shared by multiple processes in a sequential machine. The shared-memory paradigm makes it easier to write parallel programs, but tuning the application to reduce the impact of frequent long-latency memory accesses still requires substantial programmer effort.

From the perspective of system architecture, current mainstream shared memory multiprocessors fall into three categories as shown in Figure 1 :

UMA(Unified Memory Access), NUMA (Non-Uniform Memory Access) and COMA (Cache-only Memory Architectures).

Figure 1: Shared Memory Multiprocessors


UMA-Uniform Memory Access

All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data. Each processor may use a private cache. Peripherals are also shared in some fashion; The UMA model is suitable for general purpose and time sharing applications by multiple users. It can be used to speed up the execution of a single large program in time critical applications.


Figure 2: UMA-Uniform Memory Access

SMP is a common UMA architecture, in which multiple processors are connected on the system memory symmetrically and access the system memory equally and uniformly. Since all processors in the SMP system share the bus and competition conflict upgrades dramatically when the number of processors increases.

Due to the performance bottleneck of the system bus, current SMP system usually has only tens of processors with limited scalability. This architecture provides almost identical memory access latencies for any processor. But on the other hand, a common system bus is a potential bottleneck of the entire memory system in terms of bandwidth. Indeed, if a multi-threaded application is critical to memory bandwidth, its performance will be limited by this memory organization.


NUMA (Non-Uniform Memory Access)

NUMA is specialized memory architecture for multiprocessor based systems where a set of CPUs on one system memory bus is fixed and other sets of CPUs are on different memory buses and the various processing nodes are connected by means of a high speed connection. This architecture is in contrast SMP where all memory access is shared through a single memory bus.The whole system logically divides into multiple nodes, which can access both local and remote memory resources. It is faster to access local memory than access remote memory. This is the reason for the name, non-uniform memory access architecture.

Consider Figure 3. Each group of processors has its own memory and possibly its own I/O channels, but each CPU can access memory associated with other groups in a coherent way. Each group is called a NUMA node. The number of CPUs within a NUMA node depends on the hardware vendor.



Figure 3: NUMA-Non Uniform Memory Access


In a node, the processors share a common memory space or “local” memory. For example, an SMP system’s shared memory would make up a node. This memory, of course, provides the fastest non-cache memory access for the node’s CPU. Multiple nodes are then combined together to form a NUMA machine. Memory transfers between nodes are handled by routers. Memory that can only be accessed via a router is called ‘remote’ memory.

NUMA is easy to be managed and expanded, but costs a lot of time to access remote memory. The main benefit of NUMA is scalability. The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, all memory access is posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but not when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus and connecting the various nodes by means of a high speed interconnection.

Cache-Only Memory Architectures (COMA)

In a cache-only memory architecture, memory organization is similar to that of the NUMA in that each processor holds a portion of the address space. However, the partitioning of data among the memories does not have to be static, since all distributed memories are organized like large (second level) caches. The task of such a memory is twofold. Besides being a large cache for the processor, it may also contain some data from the shared address space that the processor never accessed before- in other words, it is a cache and a virtual part of the shared memory. This is called attraction memory.

COMA increases the chances of data being available locally because the hardware transparently replicates the data and migrates it to the memory module of the node that is currently accessing it. Each memory module acts as a huge cache memory in which each block has a tag with the address and the state.

Figure 4: COMA- Cache-Only Memory Architectures

When a processor requests a block from a remote memory, the block is inserted in both the processor’s cache and the node’s AM. A block can be evicted from an AM if another block needs the space. Ideally, with this support, the processor dynamically attracts its working set into its local memory module. The data the processor is not accessing overflows and is sent to other memories. Because a large AM is more capable of containing the node’s current working set than a cache is, more of the cache misses are satisfied locally within the node.

Cache Coherence

In multiprocessing system, suppose a task running on processor P requests the data in global memory location X, the contents of X are copied to processor P’s local cache. Now, suppose processor Q also accesses X. What happens if Q wants to write a new value over the old value of X?

There are two fundamental cache coherence policies:

  • Write-invalidate - On a write to a block by a cache all other cached copies are made invalid. Only when an invalid block is accessed, it is updated by bringing in the new block.
  • Write-update- On a write to a block by any cache, the updated value is propagated to all other caches.