CSC/ECE 506 Spring 2010/ch 6 PP: Difference between revisions
Line 85: | Line 85: | ||
'''First-In First-Out (FIFO):''' FIFO can be seen as a queue: in this policy new elements are inserted at the front evicting elements at the end of the queue. Disadvantage of this policy is that even if a block is accessed more number of times than another that was brought in later, it will be evicted. | '''First-In First-Out (FIFO):''' FIFO can be seen as a queue: in this policy new elements are inserted at the front evicting elements at the end of the queue. Disadvantage of this policy is that even if a block is accessed more number of times than another that was brought in later, it will be evicted. | ||
'''Pseudo-LRU (PLRU):''' For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work. | '''Pseudo-LRU (PLRU):''' For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work. | ||
Line 100: | Line 94: | ||
four-way set associative - three bits | four-way set associative - three bits | ||
Each bit represents one branch point in a binary decision tree; let 1 represent that the left side has been referenced more recently than theright side, and 0 vice-versa. | Each bit represents one branch point in a binary decision tree; let 1 represent that the left side has been referenced more recently than theright side, and 0 vice-versa.When a line must be replaced, the cache selects | ||
which of L0:L1 and L2:L3 was least recently used. Then the cache determines which of the two lines was least recently used and marks it for replacement. This decision tree is shown in Figure . | |||
[[Image:plru.jpg]] | |||
'''Most-Recently Used (MRU):''' Evicting the most-recently used (MRU) page does very well on LRU’s worst case. In general, however, MRU is a bad idea, since many programs exhibit temporal locality in their memory accesses, and MRU is effectively assuming that memory accesses will not exhibit locality. | |||
'''Least Frequently Used (LFU):''' LFU counts how often an item is needed. Those that are used least often are discarded first. | |||
'''Random:''' To spread allocation uniformly, blocks to be evicted are randomly selected. Some systems generate pseudorandom block numbers to get reproducible behavior, which is useful when debugging hardware | |||
==Cache Coherence== | ==Cache Coherence== |
Revision as of 19:57, 26 February 2010
Cache Structures of Multi-Core Architectures
Overview
With the advent of multicore and many core architectures, we are facing a problem that is relatively new to parallel computing, namely, the management of hierarchical parallel caches. This chapter describes some of the mainstream memory organizations in multiprocessor architectures. It also describes some of write and placement policies.
Scalable shared-memory multiprocessors are emerging as attractive platforms for applications with high-performance demands. What makes these machines attractive is the shared address space, which allows processors in a multiprocessor to share data the same way it is shared by multiple processes in a sequential machine. The shared-memory paradigm makes it easier to write parallel programs, but tuning the application to reduce the impact of frequent long-latency memory accesses still requires substantial programmer effort.
From the perspective of system architecture, current mainstream shared memory multiprocessors fall into three categories as shown in Figure 1 :
UMA(Unified Memory Access), NUMA (Non-Uniform Memory Access) and COMA (Cache-only Memory Architectures).
Figure 1: Shared Memory Multiprocessors
UMA-Uniform Memory Access
All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data. Each processor may use a private cache. Peripherals are also shared in some fashion; The UMA model is suitable for general purpose and time sharing applications by multiple users. It can be used to speed up the execution of a single large program in time critical applications.
Figure 2: UMA-Uniform Memory Access
SMP is a common UMA architecture, in which multiple processors are connected on the system memory symmetrically and access the system memory equally and uniformly. Since all processors in the SMP system share the bus and competition conflict upgrades dramatically when the number of processors increases.
Due to the performance bottleneck of the system bus, current SMP system usually has only tens of processors with limited scalability. This architecture provides almost identical memory access latencies for any processor. But on the other hand, a common system bus is a potential bottleneck of the entire memory system in terms of bandwidth. Indeed, if a multi-threaded application is critical to memory bandwidth, its performance will be limited by this memory organization.
NUMA (Non-Uniform Memory Access)
NUMA is specialized memory architecture for multiprocessor based systems where a set of CPUs on one system memory bus is fixed and other sets of CPUs are on different memory buses and the various processing nodes are connected by means of a high speed connection. This architecture is in contrast SMP where all memory access is shared through a single memory bus.The whole system logically divides into multiple nodes, which can access both local and remote memory resources. It is faster to access local memory than access remote memory. This is the reason for the name, non-uniform memory access architecture.
Consider Figure 3. Each group of processors has its own memory and possibly its own I/O channels, but each CPU can access memory associated with other groups in a coherent way. Each group is called a NUMA node. The number of CPUs within a NUMA node depends on the hardware vendor.
Figure 3: NUMA-Non Uniform Memory Access
In a node, the processors share a common memory space or “local” memory. For example, an SMP system’s shared memory would make up a node. This memory, of course, provides the fastest non-cache memory access for the node’s CPU. Multiple nodes are then combined together to form a NUMA machine. Memory transfers between nodes are handled by routers. Memory that can only be accessed via a router is called ‘remote’ memory.
NUMA is easy to be managed and expanded, but costs a lot of time to access remote memory. The main benefit of NUMA is scalability. The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, all memory access is posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but not when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus and connecting the various nodes by means of a high speed interconnection.
Cache-Only Memory Architectures (COMA)
In a cache-only memory architecture, memory organization is similar to that of the NUMA in that each processor holds a portion of the address space. However, the partitioning of data among the memories does not have to be static, since all distributed memories are organized like large (second level) caches. The task of such a memory is twofold. Besides being a large cache for the processor, it may also contain some data from the shared address space that the processor never accessed before- in other words, it is a cache and a virtual part of the shared memory. This is called attraction memory.
COMA increases the chances of data being available locally because the hardware transparently replicates the data and migrates it to the memory module of the node that is currently accessing it. Each memory module acts as a huge cache memory in which each block has a tag with the address and the state.
Figure 4: COMA- Cache-Only Memory Architectures
When a processor requests a block from a remote memory, the block is inserted in both the processor’s cache and the node’s AM. A block can be evicted from an AM if another block needs the space. Ideally, with this support, the processor dynamically attracts its working set into its local memory module. The data the processor is not accessing overflows and is sent to other memories. Because a large AM is more capable of containing the node’s current working set than a cache is, more of the cache misses are satisfied locally within the node.
Write Policies
- Write-back: A write changes data only in the cache block without being propagated to the lower level memory hierarchy . Only when the block is evicted, the block is “written back”, or propagated and overwrites the stale version at the lower level.
- Write-through: Every time the processor writes to a cached memory location, both the cache and all the underlying memory components are updated.
- Write-Allocate: On a write miss, the block is brought into the cache before it is written.
- Write-No Allocate: On a write miss, the write is propagated down to the lower level memory hiearchy component without bringing the block into the cache.
Replacement Policies
Cache replacement policy has by far the strongest influence on the predictability of the cache behavior (cache hit rate). Following are some of the replacement policies used in caches today:
Least Recently Used (LRU): Cache block that was accessed the farthest back in the past is selected to be evicted to make room for a new block. This policy exploits the temporal locality that data accessed recently is more likely to be accessed again.
First-In First-Out (FIFO): FIFO can be seen as a queue: in this policy new elements are inserted at the front evicting elements at the end of the queue. Disadvantage of this policy is that even if a block is accessed more number of times than another that was brought in later, it will be evicted.
Pseudo-LRU (PLRU): For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work.
This is how it works:
two-way set associative - one bit- indicates which line of the two has been reference more recently
four-way set associative - three bits
Each bit represents one branch point in a binary decision tree; let 1 represent that the left side has been referenced more recently than theright side, and 0 vice-versa.When a line must be replaced, the cache selects which of L0:L1 and L2:L3 was least recently used. Then the cache determines which of the two lines was least recently used and marks it for replacement. This decision tree is shown in Figure .
Most-Recently Used (MRU): Evicting the most-recently used (MRU) page does very well on LRU’s worst case. In general, however, MRU is a bad idea, since many programs exhibit temporal locality in their memory accesses, and MRU is effectively assuming that memory accesses will not exhibit locality.
Least Frequently Used (LFU): LFU counts how often an item is needed. Those that are used least often are discarded first.
Random: To spread allocation uniformly, blocks to be evicted are randomly selected. Some systems generate pseudorandom block numbers to get reproducible behavior, which is useful when debugging hardware
Cache Coherence
In multiprocessing system, suppose a task running on processor P requests the data in global memory location X, the contents of X are copied to processor P’s local cache. Now, suppose processor Q also accesses X. What happens if Q wants to write a new value over the old value of X?
There are two fundamental cache coherence policies:
- Write-invalidate - On a write to a block by a cache all other cached copies are made invalid. Only when an invalid block is accessed, it is updated by bringing in the new block.
- Write-update- On a write to a block by any cache, the updated value is propagated to all other caches.
These two policies along with write back and write-through caches form many cache coherence protocols that control replacement and write policies in multiprocessor architectures. Some of the coherence protocols are:
- MSI protocol
- MESI protocol
- MOSI protocol
- MOESI protocol
- MERSI protocol
- MESIF protocol
- Write-once protocol
- Synapse protocol
- Berkeley protocol
- Illinois protocol
- Firefly protocol
- Dragon protocol
Choice of the consistency model is crucial to designing a cache coherent system. Coherence models differ in performance and scalability; each must be evaluated for every system design.
References
- http://software.intel.com/en-us/blogs/2009/03/11/learning-experience-of-numa-and-intels-next-generation-xeon-processor-i/
- Eric Hagersten, Anders Landin, and Seif Haridi, DDM- A Cache only Memory Architecture, SICS Research Report 1991
- Yan Solihin, Fundamentals of Computer Architecture.
- http://it.toolbox.com/wiki/index.php/NUMA_Architecture
- http://practical-tech.com/infrastructure/numa-theory-and-practice/
- Fredrik Dahlgren, Josep Torrellas: Research Feature- Cache Only Memory Architecture
- http://en.wikipedia.org/
- David E. Culler, Jaswinder Pal Singh, with Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, © 1999 Morgan-Kauffman, ISBN 1-55860-343-3