CSC/ECE 506 Spring 2012/7b pk: Difference between revisions
Line 65: | Line 65: | ||
Cache coherence protocols are all-hardware based because it provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based and not too costly for systems with a small number of processors. A hardware based TLB coherence can a) improved TLB coherence performance significantly and b)providee a cleaner interface to the Operating system. | Cache coherence protocols are all-hardware based because it provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based and not too costly for systems with a small number of processors. A hardware based TLB coherence can a) improved TLB coherence performance significantly and b)providee a cleaner interface to the Operating system. | ||
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) shoot down algorithm used, b) number of processors, c) position of TLB in memory heirarchy (whether shared TLB/per processor TLB/placed in memory). Shoot down algorithms also trade performance for complexity. | |||
The performance penalty for shootdown increased with the number of processors . | |||
== Links == | == Links == |
Revision as of 21:19, 18 March 2012
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing
Overview
Background - Virtual Memory, Paging and TLB
In this section we introduce the basic terminology and the set-up where TLBs are used.
A process running on a CPU has its own view of memory - "Virtual Memory" (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.
Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).
Each process has a page table that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table (Page-Table Base Register) is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).
The hardware support for paging is shown in figure. Every address is divided into two parts – a page number (p) and a page offset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit. The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2.
The standard solution to this problem is to use a special, small fast lookup hardware cache called a Translation Look-Aside Buffer (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024.
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory.
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write (protection level). Another bit valid-invalid is also added to indicate whether the frame is part of that particular process.
Mutilevel pages tables are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.
It may be noted that caches typically use "Virtual Indexed Physically Tagged" solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.Solihin
TLB coherence problem in multiprocessing
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.
A TLB entry changes because of the following events - a) A page is swapped in or out (because of context change caused by an interrupt), b) There is a TLB miss, c) A page is referenced by a process for the first time, d) A process terminates and TLB entries for it are no longer needed, e) A protection change, e.g., from read to read-write, f) Mapping changes.
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.
We also review UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.
TLB coherence through Shootdown
TLB coherence through invalidation
Other TLB coherence solutions
Unified Cache and TLB coherence solution
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed.
Synopsis
UNITD is a unified hardware cohherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is a significant improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.
Background
Cache coherence protocols are all-hardware based because it provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based and not too costly for systems with a small number of processors. A hardware based TLB coherence can a) improved TLB coherence performance significantly and b)providee a cleaner interface to the Operating system.
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) shoot down algorithm used, b) number of processors, c) position of TLB in memory heirarchy (whether shared TLB/per processor TLB/placed in memory). Shoot down algorithms also trade performance for complexity.
The performance penalty for shootdown increased with the number of processors .
Links
1. Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley
4. Microprocessor Memory Management Units- Milean Milenkovic, IBM
References
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication
2. Fundamentals of Parallel Computer Architecture, Yan Solihin
Quiz
1. Page Table is maintained as a software construct
a) True b) False
2. TLB keeps a mapping of ___ to ____
a) Virtual Page to Physical Page b) Virtual Address to Physical Address c) Physical Page to Virtual Page d) Physical Address to Virtual Address
3. Currently TLB coherence is most often achieved through
a) Software solutions b) Hardware solutions
4. UNITD is -
a) A protocol for TLB coherence b) A protocol for cache coherence c) A protocol for cache and TLB coherence
5. Which of the following is not used for TLB coherence-
a) Virtual Cache address b) Shootdown c) Invalidation d) Hardware solutions e) MOESI