ECE506 CSC/ECE 506 Spring 2012/7b na

Introduction

Virtual addresses are translated to absolute addresses using a hardware structure called the Translation Lookaside Buffer (TLB). A TLB accepts a Virtual Page Number and returns the corresponding Physical Page Number. The translation lookaside buffer performs other functions in addition to the basic address translation. The other functions include access control, program debugging support and operating system support for virtual memory.

What is a TLB?

Since all references in Level 0 systems are absolute accesses, these systems do not have TLBs. A TLB is typically not large enough to hold all the current translations. Translations for all pages in memory are stored in a memory structure called the Page Table. The TLB is organized as two parts. The instruction TLB (ITLB) is only used for instruction references, while the data TLB (DTLB) is only used for data references. A system may implement a combined TLB which is used for both instruction and data references. Additionally, translations are supported for large address ranges. Such translations are called block translations and are stored in a block TLB. Block translations map address ranges larger than a page. Block translations are useful in mapping virtual address ranges which do not get paged in and out. These block translations increase the virtual address range of the TLB thereby minimizing the virtual address translation overhead.

TLB Implementation

A common approach to reducing the overhead of address translation in paging systems is to use the descriptor cache better known as the translation lookaside bufler. The TLB contains associative memory in which it stores recently used page descriptors. The TLB contains the corresponding (virtual address/PMT entry) pairs used by recent address translations. The address-translation hardware searches the TLB associatively by the page portion of the virtual address when the processor initiates a memory reference. When the match is found, the TLB furnishes the corresponding PMT entry. The physical address stored in the PMT entry forms the physical memory address. Protection is enforced by comparing task boundaries and access rights bits stored in the PMT entry with the obtained address and the intended mode of memory access. The MMU flags discrepancies as protection violations and aborts the memory reference. In addition, a processor exception is raised to allow the operating system to abort the offending task.

TLBs generally have high hit ratios, in excess of 90 percent, and thus complete most address translations quickly. When a virtual address is not found in the TLB (a miss), the memory management hardware has to reference the PMTs in memory to complete the translation. The root pointer to the address-translation tree of a specific task is provided by the operating system as a part of the task-control block. The root pointer provides the base address of the next-level mapping table. The MMU uses it to reference memory and to obtain the related PMT entry. The presence bit therein is inspected to establish whether the item that it points to is resident in (physical) memory. If not, the page-fault exception is raised, and the operating system fetches the missing item from secondary storage. When the target item is in main memory, the MMU checks whether it is the leaf node of the translation tree. If not, the process is repeated for each level of hierarchy until the last level is reached. Only the leaf level contains pointers to page frames in main memory. Other pointers are indirect and serve as trail markers within the address-translation tree. Once the last level of the mapping hierarchy is reached, the MMU stores the obtained PMT entry into the TLB for future reuse. If the TLB is full, it replaces an existing entry to make room for the incoming one.

TLBs commonly employ the least recently used (LRU) replacement policy. Simpler alternatives, such as FIFO and random, may be cheaper to implement but have inferior performance. Due to the non uniqueness of virtual addresses between different address spaces, single-task TLBs must be flushed upon every task switch. The oncoming task must usually resort to loading the TLB with its entries through a succession of TLB misses. Some designers opt to alleviate this problem by using multitasking or multicontext TLBs. A multicontext TLB removes ambiguities by tagging each entry with a unique owner identity, usually that of the enclosing task. On translation, both the address, and the owner identity must match for a given entry to produce a hit. In this way, the TLB need not be flushed upon task switches, and a re-scheduled task can benefit from finding some of its entries in the TLB. In principle, TLBs in systems with hierarchical mapping tables should provide comparatively higher hit ratios to offset the longer translation times incurred by TLB misses. Increasing the number of entries, which also tends to increase the price, generally accomplishes a higher hit ratio.

Here, X is the virtual page number to be translated into a physical page number, X' is the Physical number corresponding to X. and V is the Virtual page number of a victim page(that is a page to be kicked off the table).

TLB miss handling

Given a virtual address, the selected TLB is searched for an entry matching the Virtual Page Number. If the entry exists, the 20-bit Physical Page Number (contained in the TLB entry) is concatenated with the original 12-bit page offset to form a 32-bit absolute address. If no such entry exists, the TLB is updated by either software TLB miss handling or hardware TLB miss handling. In systems with software TLB miss handling, a TLB miss fault interruption routine performs the translation, explicitly inserts the translation and protection fields into the appropriate instruction TLB or data TLB, and restarts the interrupted instruction. To insure the completion of instructions, the TLB must be organized to simultaneously hold all necessary translations. In implementations that provide hardware for TLB miss handling, the hardware attempts to find the virtual to physical page translation in the Page Table. If the hardware is successful, it inserts the translation and protection fields into the appropriate instruction or data TLB. No interruption occurs in this case. If hardware is not successful, due to a search of the Page Table that was not exhaustive or due to the appropriate translation not existing in the Page Table, an interruption occurs so that the software can complete the process.

TLB coherence

The cache-coherent part of a multiprocessor system is required to behave as if there were logically a single DTLB and a single ITLB. All TLBs in a multiprocessor system are required to broadcast purges, except PDTLBE and PITLBE, to all other TLBs. The originating processor's purge instruction suspends until all target processors complete the purge. A processor's TLB is simply a cache on the page table entries (PTEs) used for virtual to physical address translation. A PTE can come to reside in the TLBs in multiple processors due to either actual sharing of data or process migration. PTEs may be modified- for example, when the page is swapped out or its protection is changed - leading to direct analog of the cache coherence problem. A variety of solutions have been used for TLB coherence. Software solutions, through the operating system, are popular since TLB coherence operations are much less frequent than cache coherence operations. The exact solutions used depend on whether PTEs are loaded into the TLB directly by hardware or under software control as well as on several other variables in how TLBs and operating systems are implemented. Hardware solutions are also used by some systems, particularly when TLB operations are not visible to the software. The next section provides a brief overview of four approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space identifiers (ASIDs), and hardware TLB coherence. Further details can be found in (Thompson et al. 1988; Rosenburg 1989; Teller 1990) and the papers referenced therein.

Virtual Memory in modern computer systems

Approaches to avoid TLB coherence problem

Virtually addressed caches

TLBs, and hence the TLB coherence problem, can be avoided entirely by using virtually addressed caches. Address translation is now needed only on cache misses, so particularly if the cache miss rate is small, we can use the page tables directly. PTEs are brought into the regular data cache when they are accessed and are therefore kept coherent by the cache coherence mechanism. However, when a physical page is swapped out of memory or its protection changed, this is not visible to the cache coherence hardware, so the PTE must be flushed from the virtually addressed caches of all processors by the operating system. In addition, the coherence problem for virtually addressed caches must be solved. This approach was explored in the SPUR research project (Hill et al. 1986; WOOD et al. 1986)

TLB shootdown

A second approach is called TLB shootdown. There are many variants that rely on different (but small) amounts of hardware support, usually including support for interprocessor interrupts and invalidation of individual TLB entries. The TLB coherence procedure is invoked by a processor, called the initiator, when it makes changes to PTEs that may be cached by other TLBs. Since changes to PTEs must be made by the operating system, the OS knows which PTEs are being changed and it may also know which other processors might be caching then in their TLBs ( conservatively, since entries may have been replaced). The OS kernel locks the PTEs being changed (or the relevant page table sections, depending on the granularity of locking) and sends interrupts to other processors that it thinks have copies. On being interrupted, the recipients disable interrupts, look at the list of page table entries being modified (which is in shared memory), and locally invalidate those entries from their TLBs. The initiator waits for them to finish, perhaps by polling shared memory locations, and then unlocks the page table sections. A different, somewhat more complex shootdown algorithm is used in the MACH operating system (Black et al. 1989). <ref>http://stackoverflow.com/questions/3748384/what-is-tlb-shootdown</ref>.

Address Space Identifier (ASID)

Some processor families, most notably the MIPS family from Silicon Graphics, use software-loaded rather than hardware-loaded TLBs, which means that the OS is involved not only in PTE modifications but also in loading a PTE into the TLB on a miss. In these cases, the coherence problem for process-private pages due to process migration can be solved using a third approach, that of ASIDs, which avoids interrupts and TLB shootdown. Every TLB entry has an ASID field associated with it to avoid flushing the entire TLB on a context switch(just as the process identifier is used in virtually addressed caches). In the case of the TLBs, however, ASIDs are like tags allocated dynamically by the OS on a per-processor basis, using a free pool to which they are returned when TLB entries are replaced; they are not associated with processes for their lifetime. One way to use the ASIDs, as was done in the IRIX 5.2 operating system, follows. The OS maintains an array for each process that tracks the ASID assigned to that process on each of the processors in the system. When a process modifies a PTE, the ASID of that process for all other processors is set to zero. This ensures that when the process is migrated to another processor, it will find its ASID to be zero there; the kernel will therefore allocate it a new one, thus preventing use of stale TLB entries. TLB coherence for pages truly shared by the processes is performed using TLB shootdown.

Figure shows the organization of virtually addressed caches.

Organization of Virtual Addressed Caches<ref>lec15 from NCSU ECE506</ref>

From the figure “ASID” means “address-space identifier,” which is often called a process identifier (PID) in some books. The diagram shows the “V-index” overlapping the (virtual) page number. If not, we wouldn’t need a virtually addressed cache. We could do address translation in parallel with cache access with a physically addressed cache. Also the V-index actually does not stored in the cache. Like the index (“set” or “line”) field in physically addressed caches, it is not stored, but just tells what line or set to look in for the data<ref>lec15 from NCSU ECE506</ref>.

Invalidate technique <ref>lec15 from NCSU ECE506</ref>

Finally, some processor families provide hardware instructions to invalidate other processors' TLBs. In the PowerPC family (Weiss and Smith 1994), the "TLB invalidate entry" instruction (tlbie) broadcasts the page address on the bus so that the snooping hardware on other processors can automatically invalidate the corresponding TLB entries without interrupting the processor. The algorithm for handling changes to PTEs is simple: the operating system first makes changes to the page table and then issues a tlbie instruction for the changed PTEs. If the TLB is hardware loaded (as it is in the PowerPC), the OS does not know which other TLBs might be caching the PTE, so the invalidation must be broadcast to all processors. Broadcast is well suited to a bus but undesirable for the more scalable systems with distributed networks.

Multiprocessor Considerations

In multiprocessor systems with shared virtual memory and multiple MMUs, a mapping descriptor can be simultaneously cached in multiple TLBs. Since the contents of a mapping descriptor can vary in response to changes of state in the related page in memory, multiple copies of a given descriptor may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence. In a milder form, the coherence problem exists even in uniprocessor systems in which the TLB and memory copies of a mapping descriptor should be identical. Moreover, the updating of a memory-based descriptor should be an atomic action that makes the descriptor consistent with itself. The TLB coherence problem shares many characteristics with its better known cache-coherence counterpart. As with caches, a crude way to deal with TLB coherence is to disallow TLB buffering of shareable descriptors. Forcing all mappings to complete by using the in-memory copy of a shared descriptor avoids the coherence problem at the cost of increased mapping delays.

Numerous other ways to deal with the problem require some sort of hardware assistance such as snooping TLBs, remote TLB entry invalidation, and selective disabling or flushing of TLB entries. In systems that have both caches and TLBs, the two coherence problems are interdependent in perhaps non-obvious ways. For example, disallowing placement of shareable entries into TLBs may not achieve TLB coherence if caching of the mapping descriptors can occur and cache coherence is not enforced. Consistency and coherence of shared entries in combined MMU/cache/memory systems still pose an open research problem. If the past is any indicator, some time will elapse before a few dominant solutions emerge and find their way into commercial implementations of cache and MMU units. In the meantime, designers of multi microprocessor systems will have to resort to custom solutions.

References

ECE506 CSC/ECE 506 Spring 2012/7b na

Contents

Introduction

What is a TLB?

TLB Implementation

TLB miss handling

TLB coherence

Approaches to avoid TLB coherence problem

Virtually addressed caches

TLB shootdown

Address Space Identifier (ASID)

Invalidate technique <ref>lec15 from NCSU ECE506</ref>

Multiprocessor Considerations

References

Navigation menu

ECE506 CSC/ECE 506 Spring 2012/7b na

Introduction

What is a TLB?

TLB Implementation

TLB miss handling

TLB coherence

Approaches to avoid TLB coherence problem

Virtually addressed caches

TLB shootdown

Address Space Identifier (ASID)

Invalidate technique <ref>lec15 from NCSU ECE506</ref>

Multiprocessor Considerations

References

Navigation menu

Search