<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Psamoue</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Psamoue"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Psamoue"/>
	<updated>2026-06-26T16:03:31Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60596</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60596"/>
		<updated>2012-03-26T22:12:10Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
[http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-18.html#HEADING18-0 Address Resolution and the TLB]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication, 2004&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin, 2008-2009&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
::a) True&lt;br /&gt;
::b) False&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
::a) Virtual Page to Physical Page&lt;br /&gt;
::b) Virtual Address to Physical Address&lt;br /&gt;
::c) Physical Page to Virtual Page&lt;br /&gt;
::d) Physical Address to Virtual Address&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
::a) Software solutions&lt;br /&gt;
::b) Hardware solutions&lt;br /&gt;
4. UNITD is (choose one answer)&lt;br /&gt;
::a) A protocol for TLB coherence only&lt;br /&gt;
::b) A protocol for cache coherence only&lt;br /&gt;
::c) A protocol for cache and TLB coherence&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
::a) Virtual Cache address&lt;br /&gt;
::b) Shootdown&lt;br /&gt;
::c) Invalidation&lt;br /&gt;
::d) Hardware solutions&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
::a) TLB active flag&lt;br /&gt;
::b) Hardware instructions&lt;br /&gt;
::c) Inter-Processor Interrupts&lt;br /&gt;
::d) PMT entries in main memory&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
::a) Increase in protection level; &lt;br /&gt;
::b) Changes to virtual-physical address mappings.&lt;br /&gt;
::c) Page table reference counts&lt;br /&gt;
::d) Presence bit&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
::a) The unique ID of a process for its lifetime&lt;br /&gt;
::b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
::c) The unique ID of a PMT entry&lt;br /&gt;
::d) The unique ID of a page&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
::a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
::b) Between PMT entries in main memory and processor registers&lt;br /&gt;
::c) Between PMT entries in main memory and the instruction cache&lt;br /&gt;
::d) Between PMT entries in main memory and the data cache&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
::a) Via a shared memory semaphore&lt;br /&gt;
::b) Via the interrupt IDC&lt;br /&gt;
::c) Via a hardware instruction&lt;br /&gt;
::d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60595</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60595"/>
		<updated>2012-03-26T22:04:15Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Quiz */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
[http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-18.html#HEADING18-0 Address Resolution and the TLB]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication, 2004&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin, 2008-2009&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
::a) True&lt;br /&gt;
::b) False&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
::a) Virtual Page to Physical Page&lt;br /&gt;
::b) Virtual Address to Physical Address&lt;br /&gt;
::c) Physical Page to Virtual Page&lt;br /&gt;
::d) Physical Address to Virtual Address&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
::a) Software solutions&lt;br /&gt;
::b) Hardware solutions&lt;br /&gt;
4. UNITD is (choose one answer)&lt;br /&gt;
::a) A protocol for TLB coherence only&lt;br /&gt;
::b) A protocol for cache coherence only&lt;br /&gt;
::c) A protocol for cache and TLB coherence&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
::a) Virtual Cache address&lt;br /&gt;
::b) Shootdown&lt;br /&gt;
::c) Invalidation&lt;br /&gt;
::d) Hardware solutions&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
::a) TLB active flag&lt;br /&gt;
::b) Hardware instructions&lt;br /&gt;
::c) Inter-Processor Interrupts&lt;br /&gt;
::d) PMT entries in main memory&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
::a) Increase in protection level; &lt;br /&gt;
::b) Changes to virtual-physical address mappings.&lt;br /&gt;
::c) Page table reference counts&lt;br /&gt;
::d) Presence bit&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
::a) The unique ID of a process for its lifetime&lt;br /&gt;
::b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
::c) The unique ID of a PMT entry&lt;br /&gt;
::d) The unique ID of a page&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
::a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
::b) Between PMT entries in main memory and processor registers&lt;br /&gt;
::c) Between PMT entries in main memory and the instruction cache&lt;br /&gt;
::d) Between PMT entries in main memory and the data cache&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
::a) Via a shared memory semaphore&lt;br /&gt;
::b) Via the interrupt IDC&lt;br /&gt;
::c) Via a hardware instruction&lt;br /&gt;
::d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60594</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60594"/>
		<updated>2012-03-26T22:02:46Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Quiz */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
[http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-18.html#HEADING18-0 Address Resolution and the TLB]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication, 2004&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin, 2008-2009&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
#Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
#a) True&lt;br /&gt;
::b) False&lt;br /&gt;
#TLB keeps a mapping of ___ to ____&lt;br /&gt;
::a) Virtual Page to Physical Page&lt;br /&gt;
::b) Virtual Address to Physical Address&lt;br /&gt;
::c) Physical Page to Virtual Page&lt;br /&gt;
::d) Physical Address to Virtual Address&lt;br /&gt;
#Currently TLB coherence is most often achieved through&lt;br /&gt;
::a) Software solutions&lt;br /&gt;
::b) Hardware solutions&lt;br /&gt;
#UNITD is (choose one answer)&lt;br /&gt;
::a) A protocol for TLB coherence only&lt;br /&gt;
::b) A protocol for cache coherence only&lt;br /&gt;
::c) A protocol for cache and TLB coherence&lt;br /&gt;
#Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
::a) Virtual Cache address&lt;br /&gt;
::b) Shootdown&lt;br /&gt;
::c) Invalidation&lt;br /&gt;
::d) Hardware solutions&lt;br /&gt;
#What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
::a) TLB active flag&lt;br /&gt;
::b) Hardware instructions&lt;br /&gt;
::c) Inter-Processor Interrupts&lt;br /&gt;
::d) PMT entries in main memory&lt;br /&gt;
#Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
::a) Increase in protection level; &lt;br /&gt;
::b) Changes to virtual-physical address mappings.&lt;br /&gt;
::c) Page table reference counts&lt;br /&gt;
::d) Presence bit&lt;br /&gt;
#An ASID refers to: (choose one answer)&lt;br /&gt;
::a) The unique ID of a process for its lifetime&lt;br /&gt;
::b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
::c) The unique ID of a PMT entry&lt;br /&gt;
::d) The unique ID of a page&lt;br /&gt;
#If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
::a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
::b) Between PMT entries in main memory and processor registers&lt;br /&gt;
::c) Between PMT entries in main memory and the instruction cache&lt;br /&gt;
::d) Between PMT entries in main memory and the data cache&lt;br /&gt;
#How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
::a) Via a shared memory semaphore&lt;br /&gt;
::b) Via the interrupt IDC&lt;br /&gt;
::c) Via a hardware instruction&lt;br /&gt;
::d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60593</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60593"/>
		<updated>2012-03-26T22:02:33Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Quiz */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
[http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-18.html#HEADING18-0 Address Resolution and the TLB]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication, 2004&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin, 2008-2009&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
#Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
##a) True&lt;br /&gt;
::b) False&lt;br /&gt;
#TLB keeps a mapping of ___ to ____&lt;br /&gt;
::a) Virtual Page to Physical Page&lt;br /&gt;
::b) Virtual Address to Physical Address&lt;br /&gt;
::c) Physical Page to Virtual Page&lt;br /&gt;
::d) Physical Address to Virtual Address&lt;br /&gt;
#Currently TLB coherence is most often achieved through&lt;br /&gt;
::a) Software solutions&lt;br /&gt;
::b) Hardware solutions&lt;br /&gt;
#UNITD is (choose one answer)&lt;br /&gt;
::a) A protocol for TLB coherence only&lt;br /&gt;
::b) A protocol for cache coherence only&lt;br /&gt;
::c) A protocol for cache and TLB coherence&lt;br /&gt;
#Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
::a) Virtual Cache address&lt;br /&gt;
::b) Shootdown&lt;br /&gt;
::c) Invalidation&lt;br /&gt;
::d) Hardware solutions&lt;br /&gt;
#What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
::a) TLB active flag&lt;br /&gt;
::b) Hardware instructions&lt;br /&gt;
::c) Inter-Processor Interrupts&lt;br /&gt;
::d) PMT entries in main memory&lt;br /&gt;
#Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
::a) Increase in protection level; &lt;br /&gt;
::b) Changes to virtual-physical address mappings.&lt;br /&gt;
::c) Page table reference counts&lt;br /&gt;
::d) Presence bit&lt;br /&gt;
#An ASID refers to: (choose one answer)&lt;br /&gt;
::a) The unique ID of a process for its lifetime&lt;br /&gt;
::b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
::c) The unique ID of a PMT entry&lt;br /&gt;
::d) The unique ID of a page&lt;br /&gt;
#If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
::a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
::b) Between PMT entries in main memory and processor registers&lt;br /&gt;
::c) Between PMT entries in main memory and the instruction cache&lt;br /&gt;
::d) Between PMT entries in main memory and the data cache&lt;br /&gt;
#How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
::a) Via a shared memory semaphore&lt;br /&gt;
::b) Via the interrupt IDC&lt;br /&gt;
::c) Via a hardware instruction&lt;br /&gt;
::d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60592</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60592"/>
		<updated>2012-03-26T22:01:57Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Quiz */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
[http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-18.html#HEADING18-0 Address Resolution and the TLB]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication, 2004&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin, 2008-2009&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
#Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
::a) True&lt;br /&gt;
::b) False&lt;br /&gt;
#TLB keeps a mapping of ___ to ____&lt;br /&gt;
::a) Virtual Page to Physical Page&lt;br /&gt;
::b) Virtual Address to Physical Address&lt;br /&gt;
::c) Physical Page to Virtual Page&lt;br /&gt;
::d) Physical Address to Virtual Address&lt;br /&gt;
#Currently TLB coherence is most often achieved through&lt;br /&gt;
::a) Software solutions&lt;br /&gt;
::b) Hardware solutions&lt;br /&gt;
#UNITD is (choose one answer)&lt;br /&gt;
::a) A protocol for TLB coherence only&lt;br /&gt;
::b) A protocol for cache coherence only&lt;br /&gt;
::c) A protocol for cache and TLB coherence&lt;br /&gt;
#Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
::a) Virtual Cache address&lt;br /&gt;
::b) Shootdown&lt;br /&gt;
::c) Invalidation&lt;br /&gt;
::d) Hardware solutions&lt;br /&gt;
#What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
::a) TLB active flag&lt;br /&gt;
::b) Hardware instructions&lt;br /&gt;
::c) Inter-Processor Interrupts&lt;br /&gt;
::d) PMT entries in main memory&lt;br /&gt;
#Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
::a) Increase in protection level; &lt;br /&gt;
::b) Changes to virtual-physical address mappings.&lt;br /&gt;
::c) Page table reference counts&lt;br /&gt;
::d) Presence bit&lt;br /&gt;
#An ASID refers to: (choose one answer)&lt;br /&gt;
::a) The unique ID of a process for its lifetime&lt;br /&gt;
::b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
::c) The unique ID of a PMT entry&lt;br /&gt;
::d) The unique ID of a page&lt;br /&gt;
#If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
::a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
::b) Between PMT entries in main memory and processor registers&lt;br /&gt;
::c) Between PMT entries in main memory and the instruction cache&lt;br /&gt;
::d) Between PMT entries in main memory and the data cache&lt;br /&gt;
#How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
::a) Via a shared memory semaphore&lt;br /&gt;
::b) Via the interrupt IDC&lt;br /&gt;
::c) Via a hardware instruction&lt;br /&gt;
::d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60591</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60591"/>
		<updated>2012-03-26T22:00:28Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
[http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-18.html#HEADING18-0 Address Resolution and the TLB]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication, 2004&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin, 2008-2009&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
#Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
&lt;br /&gt;
::a) True&lt;br /&gt;
::b) False&lt;br /&gt;
&lt;br /&gt;
#TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
::a) Virtual Page to Physical Page&lt;br /&gt;
::b) Virtual Address to Physical Address&lt;br /&gt;
::c) Physical Page to Virtual Page&lt;br /&gt;
::d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
#Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
::a) Software solutions&lt;br /&gt;
::b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
#UNITD is (choose one answer)&lt;br /&gt;
&lt;br /&gt;
::a) A protocol for TLB coherence only&lt;br /&gt;
::b) A protocol for cache coherence only&lt;br /&gt;
::c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
#Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
&lt;br /&gt;
::a) Virtual Cache address&lt;br /&gt;
::b) Shootdown&lt;br /&gt;
::c) Invalidation&lt;br /&gt;
::d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
#What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
::a) TLB active flag&lt;br /&gt;
::b) Hardware instructions&lt;br /&gt;
::c) Inter-Processor Interrupts&lt;br /&gt;
::d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
#Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
&lt;br /&gt;
::a) Increase in protection level; &lt;br /&gt;
::b) Changes to virtual-physical address mappings.&lt;br /&gt;
::c) Page table reference counts&lt;br /&gt;
::d) Presence bit&lt;br /&gt;
&lt;br /&gt;
#An ASID refers to: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
::a) The unique ID of a process for its lifetime&lt;br /&gt;
::b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
::c) The unique ID of a PMT entry&lt;br /&gt;
::d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
#If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
&lt;br /&gt;
::a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
::b) Between PMT entries in main memory and processor registers&lt;br /&gt;
::c) Between PMT entries in main memory and the instruction cache&lt;br /&gt;
::d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
#How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
&lt;br /&gt;
::a) Via a shared memory semaphore&lt;br /&gt;
::b) Via the interrupt IDC&lt;br /&gt;
::c) Via a hardware instruction&lt;br /&gt;
::d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60590</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60590"/>
		<updated>2012-03-26T21:54:34Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
&lt;br /&gt;
[http://sequoia.ict.pwr.wroc.pl/~iro/RISC/sm/www.hp.com/acd-18.html#HEADING18-0 Address Resolution and the TLB]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication, 2004&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin, 2008-2009&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence only&lt;br /&gt;
b) A protocol for cache coherence only&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60589</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60589"/>
		<updated>2012-03-26T21:45:53Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
;PMT(Page Mapping Table)&lt;br /&gt;
: This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
;PTE (Page Table Entry)&lt;br /&gt;
: A row in a page mapping table that relates a single page in virtual memory to a physical memory address or a page frame.&lt;br /&gt;
&lt;br /&gt;
;TLB(Translation Look-Aside Buffer)&lt;br /&gt;
: Associative, high-speed memory used to cache page table information and speed up virtual memory references from the processor.&lt;br /&gt;
&lt;br /&gt;
;Presence Bit&lt;br /&gt;
: One of the data elements in each PMT entry. Indicates whether the page is loaded in main memory (versus stored to disk).&lt;br /&gt;
&lt;br /&gt;
;Safe Change&lt;br /&gt;
: A type of update that can be made to a PMT safely without also updating the cached TLB counterpart.&lt;br /&gt;
&lt;br /&gt;
;TLB Coherence Strategy&lt;br /&gt;
: A process or system for maintaining the consistency of information between the processor TLBs and the PMT entries.&lt;br /&gt;
&lt;br /&gt;
;IDT (Interrupt Descriptor Table)&lt;br /&gt;
: A vector-based data structure used to communicate information between processors during an inter-processor interrupt.&lt;br /&gt;
&lt;br /&gt;
;Interrupt&lt;br /&gt;
: A hardware or software based method for signalling to a processor that there is work for it.&lt;br /&gt;
&lt;br /&gt;
;ASID (Address Space Identifier)&lt;br /&gt;
: A number assigned to process that is unique within the scope of a processor for a specified time period. A process may be assigned multiple ASIDs across time. Used with the TLB entries, the ASID helps the discard stale TLB entries.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence only&lt;br /&gt;
b) A protocol for cache coherence only&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60588</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60588"/>
		<updated>2012-03-26T21:20:34Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Definitions ==&lt;br /&gt;
&lt;br /&gt;
PMT(Page Mapping Table) - This is an in-memory data structure that primarily helps the operating system map two address spaces two one another: virtual memory and physical memory. In addition to this mapping information, the PMT also stores additional data fields to keep track of whether the virtual memory has been loaded in physical memory, permissions and utilization metrics. Coupled with the TLB, the PMT is a critical component of virtual memory systems. Learn more about page tables [http://en.wikipedia.org/wiki/Page_table here].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence only&lt;br /&gt;
b) A protocol for cache coherence only&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60096</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60096"/>
		<updated>2012-03-20T00:30:44Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Quiz */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence only&lt;br /&gt;
b) A protocol for cache coherence only&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60095</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60095"/>
		<updated>2012-03-20T00:27:30Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Address Space Identifier (ASID) based Approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60094</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60094"/>
		<updated>2012-03-20T00:26:57Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Instruction-based Invalidation */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60093</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60093"/>
		<updated>2012-03-20T00:26:18Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Hierarchical TLBs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60092</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60092"/>
		<updated>2012-03-20T00:25:59Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Shootdown */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
'''Implementations'''&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60091</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60091"/>
		<updated>2012-03-20T00:23:45Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Background - Virtual Memory, Paging and TLB */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60090</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60090"/>
		<updated>2012-03-20T00:23:04Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Background - Virtual Memory, Paging and TLB */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure811.png‎|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
=== Virtually Indexed Caches ===&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
=== TLB Shootdown ===&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Hierarchical TLBs ===&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Instruction-based Invalidation ===&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Address Space Identifier (ASID) based Approach ===&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Validation Based Approach ===&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
=== Background ===&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
=== UNITD Coherence Implementation ===&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
=== Performance ===&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure811.png&amp;diff=60089</id>
		<title>File:TLBFigure811.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure811.png&amp;diff=60089"/>
		<updated>2012-03-20T00:22:02Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60084</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=60084"/>
		<updated>2012-03-20T00:04:31Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Quiz */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;lt;p&amp;gt;&amp;lt;font size=&amp;quot;3&amp;quot;&amp;gt;&lt;br /&gt;
TLB (Translation Lookaside Buffer) Coherence in Multiprocessing&lt;br /&gt;
&amp;lt;/font&amp;gt;&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
The cache coherence problem in multiprocessing architectures and their hardware based protocol solutions have received great attention. Another coherence problem in multiprocesing is that of TLBs (Transaction Lookaside Buffers). A TLB is a fully associate hardware cache that maintains virtual to physical mapping of most recently used pages. TLB is used by CPU for fast look up of the physical page frame number for a virtual memory location. The same lookup from the page-table for the process (software based construct) is slower. A page may become invalid due to a swap out to memory or may change protection level (read v/s read-write). Maintaining coherence between TLB entries on different processors for the same page (as its state changes) is called TLB coherence problem. This coherence is mantained most often by software based solutions by the operating systems.  In this note, we explain a) the background for used of TLB hardware, b) current solutions for maintaining TLB coherence and c) finally a proposal for a unified cache-TLB coherence protocol (that is hardware based).&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register-PTBR''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (A page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
[[File:TLB.png|400px|thumb|right| Virtual to physical page mapping using TLB and Page table]]&lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately available and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover following approaches to TLB coherence: virtually addressed caches, software TLB shootdown, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper &amp;lt;ref&amp;gt;[http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&amp;lt;/ref&amp;gt;&lt;br /&gt;
are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure1.png|400px|thumb|right|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|400px|thumb|right|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocol used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improve TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from Figure-3 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to Figure-4).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 2a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graph below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
4. Microprocessor Memory Management Unit, Milan Milenkovic, IEEE, Vol10, Issue-2, 1990, Pages 70-85&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following approaches provides a solution for TLB coherence (choose at least one)-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
6. What messaging implementation is typically used with TLB Shootdown: (choose one answer)&lt;br /&gt;
a) TLB active flag&lt;br /&gt;
b) Hardware instructions&lt;br /&gt;
c) Inter-Processor Interrupts&lt;br /&gt;
d) PMT entries in main memory&lt;br /&gt;
&lt;br /&gt;
7. Which data elements of a PMT entry are unsafe without a TLB coherence scheme: (choose all that apply)&lt;br /&gt;
a) Increase in protection level; &lt;br /&gt;
b) Changes to virtual-physical address mappings.&lt;br /&gt;
c) Page table reference counts&lt;br /&gt;
d) Presence bit&lt;br /&gt;
&lt;br /&gt;
8. An ASID refers to: (choose one answer)&lt;br /&gt;
a) The unique ID of a process for its lifetime&lt;br /&gt;
b) The unique ID of a process-processor pair at a point in time&lt;br /&gt;
c) The unique ID of a PMT entry&lt;br /&gt;
d) The unique ID of a page&lt;br /&gt;
&lt;br /&gt;
9. If Virtually Indexed data caches avoid the use of TLBs, where is the coherence issue identified? (choose one answer)&lt;br /&gt;
a) Between PMT entries in main memory and the disk storage&lt;br /&gt;
b) Between PMT entries in main memory and processor registers&lt;br /&gt;
c) Betwee PMT entries in main memory and the instruction cache&lt;br /&gt;
d) Between PMT entries in main memory and the data cache&lt;br /&gt;
&lt;br /&gt;
10. How does the initiating processor prevent updates to the PMT entries by other processors during a PMT update? (Choose one answer)&lt;br /&gt;
a) Via a shared memory semaphore&lt;br /&gt;
b) Via the interrupt IDC&lt;br /&gt;
c) Via a hardware instruction&lt;br /&gt;
d) Via a lock to the page table in main memory&lt;br /&gt;
&lt;br /&gt;
Answers&lt;br /&gt;
1-a, 2-a, 3-a, 4-c, 5-a,b,c,d, 6-c, 7-a,b, 8-b, 9-d, 10-d&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59942</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59942"/>
		<updated>2012-03-19T04:15:10Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Figure 1: Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Figure 2: Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB.]]&lt;br /&gt;
&lt;br /&gt;
Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB (physical address 12 in Figure 2(a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 2(b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure1.png|frame|none|alt=alt text|Figure 3: Per-shootdown Latency]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png|frame|none|alt=alt text|Figure 4: Shootdown performance overhead on Phoenix benchmarks]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png|frame|none|alt=alt text|Figure 5: single_unmap benchmark. UNITD speedup over baseline system for directory.]]&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59937</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59937"/>
		<updated>2012-03-19T04:11:17Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in (a) above. Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure (b) above illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.]]&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png]]&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59935</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59935"/>
		<updated>2012-03-19T04:08:10Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png|frame|none|alt=alt text|caption Shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.]].&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png|frame|none|alt=alt text|caption Figure 2 - Shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in Figure 4a). Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block.]].&lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 4b illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png]]&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59932</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59932"/>
		<updated>2012-03-19T04:00:48Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure5.png]]&lt;br /&gt;
&lt;br /&gt;
Figure-3 shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png]]&lt;br /&gt;
&lt;br /&gt;
Figure-4 shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in Figure 4a). Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block. &lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 4b illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png]]&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59927</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59927"/>
		<updated>2012-03-19T03:56:44Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure5.png]]&lt;br /&gt;
&lt;br /&gt;
Figure-3 shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png]]&lt;br /&gt;
&lt;br /&gt;
Figure-4 shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in Figure 4a). Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block. &lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 4b illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png]]&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59926</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59926"/>
		<updated>2012-03-19T03:55:45Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure5.png]]&lt;br /&gt;
&lt;br /&gt;
Figure-3 shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.png]]&lt;br /&gt;
&lt;br /&gt;
Figure-4 shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in Figure 4a). Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block. &lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 4b illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.png]]&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59924</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59924"/>
		<updated>2012-03-19T03:54:46Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure5.PNG]]&lt;br /&gt;
&lt;br /&gt;
Figure-3 shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure6.PNG]]&lt;br /&gt;
&lt;br /&gt;
Figure-4 shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in Figure 4a). Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block. &lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 4b illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure2.PNG]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.PNG]]&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure7.PNG]]&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59922</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59922"/>
		<updated>2012-03-19T03:43:48Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:Figure3.JPG]]&lt;br /&gt;
&lt;br /&gt;
Figure-3 shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
Figure-4 shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in Figure 4a). Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block. &lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 4b illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59921</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59921"/>
		<updated>2012-03-19T03:43:03Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Unified Cache and TLB coherence solution */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is placed close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The effective latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence Implementation'''&lt;br /&gt;
&lt;br /&gt;
At a high level, UNITD integrates the TLBs into the existing cache coherence protocol (such as typical MOSI -Modified Owned Share Invalid coherence states). TLBs are simply additional caches that participate in the coherence protocol however like read-only instruction caches. UNITD has no impact on the cache coherence protocol and thus does not increase its complexity.&lt;br /&gt;
&lt;br /&gt;
As mentioned before, TLB entries are read-only. (Virtual to physical address mapping are never modified in the TLBs themselves but rather in the page table entries). There are only two coherence states: Shared (read-only) and Invalid. UNITD uses a Valid bit in TLB to maintain an entry’s coherence state. When a translation is inserted into a TLB, it is marked as Shared. The cached translation can be accessed by the local core as long as it is in the Shared state. The translation remains in this state until the TLB receives a coherence message invalidating the translation. The translation is then Invalid and thus subsequent memory accesses depending on it will miss in the TLB and reacquire the translation from the memory system.&lt;br /&gt;
&lt;br /&gt;
[[File:TLBFigure3.JPG]]&lt;br /&gt;
&lt;br /&gt;
Figure-3 shows how the PCAM is integrated into the system, with interfaces to the TLB insertion/eviction mechanism (for inserting/evicting the corresponding PCAM entries), the coherence controller (for receiving coherence invalidations) and the processor core . The PCAM is off the critical path of a memory access; it is not accessed during regular TLB lookups for obtaining translations.&lt;br /&gt;
&lt;br /&gt;
In order to integrate TLB coherence with the existing cache coherence protocol with minimal microarchitectural changes, we relax the correspondence of the translations to the memory block containing the PTE, rather than the PTE itself. Maintaining translation granularity at a coarser grain (i.e., cache block, rather than PTE) trades a small performance penalty for ease of integration. Because multiple PTEs can be placed in the same cache block, the PCAM can hold multiple copies of the same datum. For simplicity, throughout the rest of the paper we refer to PCAM entries simply as PTE addresses.&lt;br /&gt;
&lt;br /&gt;
Figure-4 shows the two operations associated with the PCAM: (a) inserting an entry into the PCAM and (b) performing a coherence invalidation at the PCAM. PTE addresses are added in the PCAM simultaneously with the insertion of their corresponding translations in the TLB. Because the PCAM has the same structure as the TLB, a PTE address is inserted in the PCAM at the same index as its corresponding translation in the TLB &lt;br /&gt;
(physical address 12 in Figure 4a). Note that there can be multiple PCAM entries with the same physical address, as in Figure 4a; this situation occurs when multiple cached translations correspond to PTEs residing in the same cache block. &lt;br /&gt;
&lt;br /&gt;
PCAM entries are removed as a result of the replacement of the corresponding translation in the TLB or due to an incoming coherence request for read-write access. If a coherence request hits in the PCAM, the Valid bit for the corresponding TLB entry is cleared. If multiple TLB translations have the same PTE block address, a PCAM lookup on this block address results in the identification of all associated TLB entries. Figure 4b illustrates a coherence invalidation of physical address 12 that hits in two PCAM entries.&lt;br /&gt;
&lt;br /&gt;
'''Performance'''&lt;br /&gt;
&lt;br /&gt;
The performance of the UNITD protocol is compared to a) a baseline system that relies on TLB shootdowns and b) to a system with ideal (zero-latency) translation invalidations. (The ideal-invalidation system uses the same modified OS as UNITD (i.e., with no TLB shootdown support) and verifies that a translation is coherent whenever it is accessed in the TLB. The validation&lt;br /&gt;
is done in the background and has no performance impact).&lt;br /&gt;
&lt;br /&gt;
UNITD is efficient in ensuring translation coherence, as it performs as well as the system with ideal TLB invalidations. UNITD speedups increase with the number of TLB shootdowns and with the number of cores. Third, as expected, UNITD has no impact on performance in the absence of TLB shootdowns. (Refer to the graphs below).&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure7.png&amp;diff=59920</id>
		<title>File:TLBFigure7.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure7.png&amp;diff=59920"/>
		<updated>2012-03-19T03:40:10Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: single_unmap benchmark. UNITD speedup over baseline system for directory.&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;single_unmap benchmark. UNITD speedup over baseline system for directory.&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure6.png&amp;diff=59919</id>
		<title>File:TLBFigure6.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure6.png&amp;diff=59919"/>
		<updated>2012-03-19T03:38:08Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: PCAM Operations&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;PCAM Operations&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Figure5.png&amp;diff=59917</id>
		<title>File:Figure5.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Figure5.png&amp;diff=59917"/>
		<updated>2012-03-19T03:35:27Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: PCAM Integration with core and coherence controller&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;PCAM Integration with core and coherence controller&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure3.png&amp;diff=59916</id>
		<title>File:TLBFigure3.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure3.png&amp;diff=59916"/>
		<updated>2012-03-19T03:33:26Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: Shootdown Performance Overhead on Phoenix Benchmarks&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Shootdown Performance Overhead on Phoenix Benchmarks&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure1.png&amp;diff=59915</id>
		<title>File:TLBFigure1.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:TLBFigure1.png&amp;diff=59915"/>
		<updated>2012-03-19T03:30:51Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: Per-shootdown Latency&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Per-shootdown Latency&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59849</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59849"/>
		<updated>2012-03-19T01:59:45Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure ([http://en.wikipedia.org/wiki/Interrupt_descriptor_table IDT]) that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59847</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59847"/>
		<updated>2012-03-19T01:58:01Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt ([http://en.wikipedia.org/wiki/Inter-processor_interrupt IPI]) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59844</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59844"/>
		<updated>2012-03-19T01:54:50Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 Virtually indexed caches] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59843</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59843"/>
		<updated>2012-03-19T01:53:36Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#tocsection-16 &amp;quot;Virtually indexed caches&amp;quot;] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59840</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59840"/>
		<updated>2012-03-19T01:52:49Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/CPU_cache#toclevel-1 tocsection-16 &amp;quot;Virtually indexed caches&amp;quot;] may be accessed using the virtual memory reference. This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59825</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59825"/>
		<updated>2012-03-19T01:45:25Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
''Hierarchical TLBs''&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at one  or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59823</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59823"/>
		<updated>2012-03-19T01:43:50Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
# The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at more or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59821</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59821"/>
		<updated>2012-03-19T01:42:50Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
''Details of the TLB Shootdown''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
&lt;br /&gt;
#  The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at more or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59819</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59819"/>
		<updated>2012-03-19T01:42:26Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB Coherence Approaches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
'''Details of the TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
&lt;br /&gt;
#  The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at more or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59808</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59808"/>
		<updated>2012-03-19T01:35:58Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
&lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
&lt;br /&gt;
#  The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at more or more TLB levels. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures &amp;lt;ref&amp;gt;[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel.&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59797</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59797"/>
		<updated>2012-03-19T01:31:30Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
TLB coherence refers to the consistency of the information stored in page-map tables (PMTs) with the cached copies of this data in each processor’s TLB. A PMT entry (and its TLB counterpart) store various data elements, including:&lt;br /&gt;
&lt;br /&gt;
* The physical memory frame number for each virtual page resident in main memory.&lt;br /&gt;
* A protection field indicating how the page may be addressed (e.g., read-only or read-write).&lt;br /&gt;
* A presence bit that indicates whether the page is in main memory.&lt;br /&gt;
* A dirty bit that indicates whether the page was modified while in main memory.&lt;br /&gt;
* Page reference history used to indicate whether a page has been referenced during some time interval.&lt;br /&gt;
&lt;br /&gt;
Consistency among TLBs and PMTs need not be maintained for all types of changes to the above data elements. For example, it is possible to safely modify page reference history in main memory without also updating the TLBs. Updates are therefore categorized as either safe or unsafe. Safe changes include:&lt;br /&gt;
&lt;br /&gt;
* A reduction in page protection (moving access rights to a page from read-only to read/write).&lt;br /&gt;
* Setting the presence bit when a page becomes resident in main memory.&lt;br /&gt;
&lt;br /&gt;
It is important to note that these inconsistencies are not ignored but rather are handled downstream by other mechanisms. For example, if the operating system kernel detects a process is attempting to write to a page marked as read-only, it will verify whether the page’s protection settings have been reduced before throwing a protection fault. If the protection settings in main memory have been reduced by another processor, the kernel would invalidate the acting processor’s TLB and retry the operation.&lt;br /&gt;
&lt;br /&gt;
The above example also illustrates why a change to the same data element in the opposite direction (an increase in protection) would be unsafe without a TLB coherence scheme as there would be no mechanism for the operating system to trap the condition and update the processor’s TLB.&lt;br /&gt;
&lt;br /&gt;
Any modifications to the mapping between a virtual page and main memory frame are also unsafe without a TLB coherence mechanism. This is in large part due to the fact that page numbers are formed by splitting a finite address space into discrete units. These pages in turn are mapped to physical memory frames as needed by the operating system. The mapping of a page to a frame is a transient condition and the same page may be mapped to different frames across time. Without a TLB coherence mechanism, it is possible that a processor may attempt to use a stale page-frame mapping to reference a frame that has been reassigned.&lt;br /&gt;
&lt;br /&gt;
''TLB Coherence Strategies''&lt;br /&gt;
&lt;br /&gt;
There are several approaches to maintaining coherence or consistency among multiple TLBs in a multi-processor machine and the page tables in main memory.&lt;br /&gt;
&lt;br /&gt;
'''Virtually Indexed Caches'''&lt;br /&gt;
&lt;br /&gt;
This approach eliminates the TLB altogether and its associated coherence issue. This approach requires processors to index data cache values using virtual addresses and thereby avoid the need for virtual-physical address translations between the processor and the L1 cache. Ultimately, however, in order to address physical main memory on an L1 miss (or L2 miss depending on the indexing strategy of the L2 cache), translation is still required from the virtual reference to a physical address in memory. In a non-TLB architecture, this translation uses the page mapping tables directly.&lt;br /&gt;
&lt;br /&gt;
Under this approach, PMT entries may be brought into the processor’s data cache. Though the cache coherence protocol in effect will manage consistency across all processor caches, it does not handle consistency between cached PMT entries and the page tables in main memory. Hence, even in the absence of a TLB there remains a coherence problem between data caches and main memory page tables.  As PMT entries in main memory are modified by the operating system kernel, the processor data caches must be reconciled by invalidating stale PMT entries.&lt;br /&gt;
&lt;br /&gt;
Since this coherence issues relates more to PMT-data cache coherence (versus TLB coherence), it will not be discussed further in this section.&lt;br /&gt;
&lt;br /&gt;
'''TLB Shootdown'''&lt;br /&gt;
&lt;br /&gt;
The TLB Shootdown method is the most widely used TLB coherence approach. The approach has variants that turn on several factors, how many processors participate in the TLB update (the victim list) and the extent of hardware support for the process, which in turn may determine the visibility of the TLB by operating system kernel.&lt;br /&gt;
&lt;br /&gt;
In the discussion below, the initiating processor is defined as the processor servicing the operating system kernel during the course of the PMT update.&lt;br /&gt;
&lt;br /&gt;
Broadly speaking, the shootdown approach includes the following two components:&lt;br /&gt;
&lt;br /&gt;
* A synchronization strategy across all processors during the PMT update. This orchestration begins by the initiating processor locking the page table (or some portion thereof). &lt;br /&gt;
* A message broadcasting mechanism that allows the initiating processor to notify other processors that their TLB cache may be out-of-date and to suspend their use of the TLB during the course of the update.&lt;br /&gt;
&lt;br /&gt;
This process is described in detail next followed by a discussion of implementations and their shootdown variants.&lt;br /&gt;
&lt;br /&gt;
# The initiating processor must first disable local interrupts, preventing another thread from preempting and modifying its TLB. It also clears its active flag for TLB usage. This flag is used to indicate whether a processor is currently using its TLB cache.&lt;br /&gt;
# The initiating processor locks the page table it is modifying or potentially some smaller portion of it. This prevents modification to the same page data by other processors.&lt;br /&gt;
# The shootdown method is primarily based on software that invalidates TLB entries. The initiating processor must place an inter-processor interrupt (IPI) request to its interrupt controller. The interrupt includes a data structure that references the location of the shootdown program and the target TLB entries that should be invalidated. The initiator waits until all processors have responded.&lt;br /&gt;
#  The recipients of the interrupt respond by disabling their local inter-processor interrupts and invalidating the affected TLB entries or the entire TLB (depending on the implementation) using the shootdown program. The receiving processor then clears its TLB active flag and enters into a wait state after servicing the interrupt. The wait state may be implemented by attempting to gain the same lock held by the initiating processor on the page table or polling shared memory.&lt;br /&gt;
# Once the interrupts have completed, the initiating processor makes changes to the PMTs and updates its own TLB cache. Before releasing the page lock, it sets its TLB active flag to true.&lt;br /&gt;
# The receiving processors exit their wait state and update their TLB cache and set their TLB active flag.&lt;br /&gt;
&lt;br /&gt;
Variations in implementations may occur at various steps above:&lt;br /&gt;
&lt;br /&gt;
* Depending on the level of hardware support for TLB loading, an operating system may only be able to estimate which TLBs of other processors are affected by a PMT update. The estimation mechanisms are conservative and may result in false positives. &lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.huji.ac.il/~etsman/papers/DiDi-PACT-2011.pdf &amp;quot;Villavieja, Carlos, et al. DiDi: Mitigating The Performance Impact of TLB Shootdowns Using A Shared TLB Directory&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Hardware features of some environments allow for less synchronization. A modified TLB shootdown approach is implemented in IBM’s RP3 architecture that eliminates the need for victim processors to busy-wait while the PMT is being updated by the initiator. &lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Some editions of Windows kernels (including XP, PAE, and 2003 Enterprise Edition) attempt to batch updates to the PMT entries in order to minimize shootdowns and their associated performance dip. &lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://msdn.microsoft.com/en-us/windows/hardware/gg487512 &amp;quot;Operating Systems and PAE Support&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* The Intel Nehalem platform uses a hierarchical TLB &lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.intel.com/pressroom/archive/reference/whitepaper_Nehalem.pdf &amp;quot;First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Hierarchical TLBs are analogous to a processor's L1 and L2 data caches. The L2 TLB may be inclusive of and potentially shared by multiple L1 TLBs. Shootdown may be used to maintain coherence at more or more TLB levels.&lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf &amp;quot;Address Translation for Manycore Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
TLB shootdown is used on the following  platforms:&lt;br /&gt;
&lt;br /&gt;
* Carnegie Mellon University's Mach operating system which has been ported to BBN Butterfly, Encore's Multimax, IBM's RP3 and Sequent's Balance and Symmetry systems. &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Linux and Windows Operating Systems running on Intel and AMD chips &lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://research.microsoft.com/pubs/101903/paper.pdf &amp;quot;Baumann, Andrew et al. The Multikernel: A new OS architecture for scalable multicore systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Intel 64 and IA-32 Architectures (&amp;lt;nowiki&amp;gt;http://download.intel.com/design/processor/manuals/253668.pdf&amp;lt;/nowiki&amp;gt;)&lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://download.intel.com/design/processor/manuals/253668.pdf &amp;quot;Intel 64 and IA-32 Architectures Software Developer's Manual&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Instruction-based Invalidation'''&lt;br /&gt;
&lt;br /&gt;
Some processors have modified their instruction set to include a TLB invalidate instruction. The instruction is invoked by the operating system kernel to broadcast the modified page address from the initiating processor. The instruction is placed on the shared bus and a snooper on each processor detects the instruction and invalidates the appropriate TLB entries.&lt;br /&gt;
&lt;br /&gt;
Examples of this include the Intel Itanium architecture (via the ptc.g and ptc.ga instructions) and the tlbivax instruction on IBM’s Power series (for Power ISA 2.06).&lt;br /&gt;
&lt;br /&gt;
An example of an operating system that uses the Itanium support for instruction based TLB invalidation is the Linux kernel(&lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.kernel.org/doc/ols/2003/ols2003-pages-76-88.pdf &amp;quot;Bryant, Raj and Hawkes, John. Linux Scalability for Large NUMA Systems&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Address Space Identifier (ASID) based Approach'''&lt;br /&gt;
&lt;br /&gt;
The MIPS family of processors assigns each TLB entry a unique identifier called an ASID. The ASID is similar to a process identifier (PID) except that its uniqueness is not guaranteed for the life of a process. Each process is assigned a unique ASID on each processor. Hence, a PID may have many ASIDs. A separate array is used to keep track of this mapping.&lt;br /&gt;
&lt;br /&gt;
When a process modifies a PTE in main memory, the kernel sets all the process’ ASIDs to 0 on the other processors. Note: This does not modify the ASIDs in the TLB entries, only the array that maps a process to a processor.&amp;lt;ref&amp;gt;Culler, David E. et al. ''Parallel Computer Architecture: A Hardware/Software Approach, 1999, p. 440,441.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the kernel find that a process is assigned a zero ASID value on a processor, it assigns it a new ASID. This new ASID will not match the stale TLB entries (and their ASIDs) on that processor, forcing a subsequent TLB invalidation. The IRIX 5.2 operating system uses this mechanism on MIPS as well as the AMD64 operating system &lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.amd64.org/fileadmin/user_upload/pub/2007XenSummit-AMD-ASIDS-Biemueller.pdf &amp;quot;Biemueller, Sebastian. ASID Management in Xen AMD-V: Partioning the physical TLB with SVM ASIDs&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Validation Based Approach'''&lt;br /&gt;
&lt;br /&gt;
Under this approach, TLB invalidation is postponed until memory is accessed. This approach eliminates the need for interrupts and synchronization. Each frame in the page table is assigned a generation count. When a change modifies the mapping between a frame and a virtual page or when the permissions are restricted (e.g., from read-write to read-only), the memory manager modifies the generation count on the frame.&lt;br /&gt;
&lt;br /&gt;
Each TLB entry stores the generation account associated with the page when the TLB entry was created. Should a processor request memory at a virtual page with a stale generation count, the request is denied and the processor is notified to update its TLB entry.&lt;br /&gt;
&amp;lt;ref&amp;gt;&lt;br /&gt;
[http://www.cs.uwaterloo.ca/~brecht/courses/702/Possible-Readings/multiprocessor/tlb-consistency-computer-1990.pdf &amp;quot;Teller, Patricia J. Translation-Lookaside Buffer Consistency&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
3. Culler, David E. et al. Parallel Computer Architecture: A Hardware/Software Approach, 1999, Gulf Professional Publishing&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59755</id>
		<title>CSC/ECE 506 Spring 2012/7b pk</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/7b_pk&amp;diff=59755"/>
		<updated>2012-03-19T00:05:25Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* TLB coherence through Shootdown */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;'''TLB (Translation Lookaside Buffer) Coherence in Multiprocessing''' &lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
&lt;br /&gt;
== Background - Virtual Memory, Paging and TLB ==&lt;br /&gt;
In this section we introduce the basic terminology and the set-up where TLBs are used. &lt;br /&gt;
&lt;br /&gt;
A process running on a CPU has its own view of memory - &amp;quot;'''Virtual Memory'''&amp;quot; (one-single space) that is mapped to actual physical memory. Virtual memory management scheme allows programs to exceed the size of physical memory space. Virtual memory operates by executing programs only partially resident in memory while relying on hardware and the operating system to bring the missing items into main memory when needed.&lt;br /&gt;
&lt;br /&gt;
'''Paging''' is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method involves breaking physical memory into fixed-size blocks called frames and breaking the virtual memory into blocks of the same size called pages. When a process is to be executed its pages are loaded into any available memory frames from the backing storage (for example-a hard drive).   &lt;br /&gt;
&lt;br /&gt;
Each process has a '''page table''' that maintains a mapping of virtual pages to physical pages. A page table is a software construct managed by operating system that is kept in main memory. For a process, a pointer to the page table ('''Page-Table Base Register''') is stored in a register. Changing page table requires changing only this one register, substantially reducing the context-switch time. (For a page table, with virtual page number that is 4 bytes long (32 bits) can point to 2^32 physical page frames. If frame size is 4Kb then a system with 4-byte entries can address 2^44 bytes (or 16 TB) of physical memory).&lt;br /&gt;
&lt;br /&gt;
The hardware support for paging is shown in figure. Every address is divided into two parts – a '''page number''' (p) and a '''page of'''fset (d). The page number is used as an index into the page table. The page table contains the base address of each page in physical memory. The base address is combined with the page offset to define the physical memory address that is sent to the memory unit.  The problem with this approach is the time required to access a user memory location. If we want to access location, we must first index into the page table (that requires a memory access of PTBR) and then find the frame number from the page table and thereafter access the required frame number. The memory access is slowed down by a factor of 2. &lt;br /&gt;
&lt;br /&gt;
The standard solution to this problem is to use a special, small fast lookup hardware cache called a '''Translation Look-Aside Buffer''' (TLB). The TLB is fully associate, high-speed memory. Each entry in TLB consists of two parts –a key (tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however is expensive. Typically, the number of entries in a TLB is small, often numbering between 64 to 1024. &lt;br /&gt;
&lt;br /&gt;
The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found its frame number is immediately availble and is used to access memory. &lt;br /&gt;
&lt;br /&gt;
If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame is obtained, we can use it to access memory. In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random.&lt;br /&gt;
&lt;br /&gt;
A page table typically has an additional bit per entry to indicate whether the frame number is read only or both read-write ('''protection level'''). Another bit '''valid-invalid''' is also added to indicate whether the frame is part of that particular process.&lt;br /&gt;
&lt;br /&gt;
'''Mutilevel pages tables''' are often used rather than single page table per process. The first level page table points to the entry in the next level page table and so on until the mapping of virtual to physical page is obtained from the last level page table.&lt;br /&gt;
&lt;br /&gt;
It may be noted that caches typically use &amp;quot;Virtual Indexed Physically Tagged&amp;quot; solution that enables simultaneous look-up of both L-1 cache and TLB. If the page block is not found in L-1 cache, a TLB look-up of address is used to refresh the page block. Simultanous look-up of both L-1 cache and TLB makes the process efficient.[[Solihin]]&lt;br /&gt;
&lt;br /&gt;
== TLB coherence problem in multiprocessing ==&lt;br /&gt;
In multiprocessor systems with shared virtual memory and multiple Memory Management Units (MMUs), the mapping of a virtual address to physical address can be simultaneously cached in multiple TLBs. (This happens because of sharing of data across processors and process migration from one processor to another). Since the contents of these mapping can change in response to changes of state (and the related page in memory), multiple copies of a given mapping may pose a consistency problem. Keeping all copies of a mapping descriptor consistent with each other is sometimes referred to as TLB coherence problem.&lt;br /&gt;
&lt;br /&gt;
A TLB entry changes because of the following events - &lt;br /&gt;
a) A page is swapped in or out (because of context change caused by an interrupt),&lt;br /&gt;
b) There is a TLB miss,&lt;br /&gt;
c) A page is referenced by a process for the first time,&lt;br /&gt;
d) A process terminates and TLB entries for it are no longer needed,&lt;br /&gt;
e) A protection change, e.g., from read to read-write,&lt;br /&gt;
f) Mapping changes.&lt;br /&gt;
&lt;br /&gt;
Of these changes a) swap outs, e) protection changes and f) mapping changes lead to TLB coherence problem. A swap-out causes the corresponding virtual-physical page mapping to be no longer valid. (This is indicated by valid/invalid  bit in the page table). If the TLB has a mapping from a page block that falls within an invalidated Page Table Entry (PTE), then it needs to be flushed out from all TLBs and should not be used. Further protection level changes for a TLB mapping need to be followed by all processors. Also mapping modification for a TLB entry (where a physical mapping changes for a virtual address) also needs to be seen coherently by all TLBs.&lt;br /&gt;
&lt;br /&gt;
Some architectures donot follow TLB coherence for each datum (i.e., TLBs may contain different values). These architectures require TLB coherence only for unsafe changes made to address translations. Unsafe changes include a) mapping modifications, b) decreasing the page privileges (e.g., from read-write to read-only) or c) marking the translation&lt;br /&gt;
as invalid. The remaining possible changes (e.g., d) increasing page privileges, e) updating the accessed/dirty bits) are considered to be safe and do not require TLB coherence. Consider one core that has a translation marked as read-only in the TLB, while another core updates the translation in the page table to be read-write. This translation update does not have to be immediately visible to the first core. Instead, the first core's TLB can be lazily updated when the core attempts to execute a store instruction; an access violation will occur and the page fault handler can load the updated translation.&lt;br /&gt;
&lt;br /&gt;
A variety of solutions have been used for TLB coherence. Software solutions through operating systems are popular since TLB coherence operations are less frequent than cache coherence operations. The exact solution depends on whether PTEs (Page Table Entries) are loaded into TLBs directly by hardware or under software control. Hardware solutions are used where TLBs are not visible to software.&lt;br /&gt;
&lt;br /&gt;
In the next sections, we cover four approaches to TLB coherence: virtually addressed caches, software TLB shoot down, address space Identifiers, hardware TLB coherence.&lt;br /&gt;
&lt;br /&gt;
We also review  UNified Instruction/Translation/Data (UNITD) Coherence, a hardware coherence framework that integrates TLB and cache coherence protocol.&lt;br /&gt;
&lt;br /&gt;
== TLB Coherence Approaches ==&lt;br /&gt;
&lt;br /&gt;
== TLB coherence through invalidation==&lt;br /&gt;
&lt;br /&gt;
== Other TLB coherence solutions ==&lt;br /&gt;
&lt;br /&gt;
== Unified Cache and TLB coherence solution ==&lt;br /&gt;
&lt;br /&gt;
In this section the salient features of UNified Instruction/Translation/Data (UNITD) Coherence protocol proposed by Romanescu/Lebeck/Sorin/Bracy in their paper are discussed. This provides an example of hardware based unified protocol for cache-TLB coherence.&lt;br /&gt;
&lt;br /&gt;
'''Synopsis'''&lt;br /&gt;
&lt;br /&gt;
UNITD is a unified hardware coherence framework that integrates TLB coherence into existing cache coherence protocol. In this protocol, the TLBs participate in cache coherence updates without needing any change to existing cache coherence protocol. This protocol is an improvement over the software based shootdown approach for TLB coherence and reduces the performance penalty due to TLB coherence significantly.&lt;br /&gt;
&lt;br /&gt;
'''Background'''&lt;br /&gt;
&lt;br /&gt;
Cache coherence protocols are all-hardware based because this provides a) higher performance and b) decoupling with architectural issues. On the Other hand, the shootdown protocal used for TLB coherence is essentially software based. A hardware based TLB coherence can a) improved TLB coherence performance significantly for high number of processors and b)provide a cleaner interface to the Operating system.&lt;br /&gt;
&lt;br /&gt;
TLB shootdowns are invisible to user applications, although they directly impact user performance. The performance impact depends on - a) position of TLB in memory heirarchy, b) shoot down algorithm used, c) number of processors. The position of TLB in memory heirarchy refers to whether the TLB is places close to processor or is close to memory. Shoot down algorithms trade performance v/s complexity.  The performance penalty for shootdown increases with the number of processors as can be seen from graph-1 that shows latency of the processor issuing shootdown and the ones receiving shootdowns. The latency is more than shown in the graphs because of TLB invalidations resulting in extra cycles spent in repopulating the TLB that can happen either from either cache or main memory depending on the case. The latency from TLB shootdowns can be higher for application that use the routine more often. (Refer to graph-2).&lt;br /&gt;
&lt;br /&gt;
'''UNITD Coherence'''&lt;br /&gt;
&lt;br /&gt;
== Links ==&lt;br /&gt;
1. [http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project8_report_ver3.pdf Address Translation for Manycore Systems,Scott Beamer and Henry Cook, UC Berkeley]&lt;br /&gt;
&lt;br /&gt;
2. [http://people.ee.duke.edu/~sorin/papers/hpca10_unitd.pdf UNified Instruction Translation Data UNITD Coherence One Protocol to Rule Them All, Romanescu/Lebeck/Sorin/Bracy]&lt;br /&gt;
&lt;br /&gt;
3. [http://books.google.com/books?id=MHfHC4Wf3K0C&amp;amp;pg=PA440&amp;amp;lpg=PA440&amp;amp;dq=TLB+coherence+culler+singh&amp;amp;source=bl&amp;amp;ots=1KHOZe8GSO&amp;amp;sig=WurZA3GPe_82wSOi3JfqjEehVoI&amp;amp;hl=en&amp;amp;sa=X&amp;amp;ei=ujBmT9TnHKW80AGhkImvCA&amp;amp;ved=0CB8Q6AEwAA#v=onepage&amp;amp;q&amp;amp;f=false Parallel Compute Architecture - A Hardware software approach - David E. Culler, Jaswinder Pal Singh, Anoop Gupta ]&lt;br /&gt;
&lt;br /&gt;
4. [http://ieeexplore.ieee.org Microprocessor Memory Management Units- Milean Milenkovic, IBM]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
1. Operating System Concepts - Silberschatz, Galvin, Gagne. Seventh Edition, Wiley publication&lt;br /&gt;
&lt;br /&gt;
2. Fundamentals of Parallel Computer Architecture, Yan Solihin&lt;br /&gt;
&lt;br /&gt;
== Quiz ==&lt;br /&gt;
&lt;br /&gt;
1. Page Table is maintained as a software construct&lt;br /&gt;
&lt;br /&gt;
a) True&lt;br /&gt;
b) False&lt;br /&gt;
&lt;br /&gt;
2. TLB keeps a mapping of ___ to ____&lt;br /&gt;
&lt;br /&gt;
a) Virtual Page to Physical Page&lt;br /&gt;
b) Virtual Address to Physical Address&lt;br /&gt;
c) Physical Page to Virtual Page&lt;br /&gt;
d) Physical Address to Virtual Address&lt;br /&gt;
&lt;br /&gt;
3. Currently TLB coherence is most often achieved through&lt;br /&gt;
&lt;br /&gt;
a) Software solutions&lt;br /&gt;
b) Hardware solutions&lt;br /&gt;
&lt;br /&gt;
4. UNITD is -&lt;br /&gt;
&lt;br /&gt;
a) A protocol for TLB coherence&lt;br /&gt;
b) A protocol for cache coherence&lt;br /&gt;
c) A protocol for cache and TLB coherence&lt;br /&gt;
&lt;br /&gt;
5. Which of the following is not used for TLB coherence-&lt;br /&gt;
&lt;br /&gt;
a) Virtual Cache address&lt;br /&gt;
b) Shootdown&lt;br /&gt;
c) Invalidation&lt;br /&gt;
d) Hardware solutions&lt;br /&gt;
e) MOESI&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58215</id>
		<title>CSC/ECE 506 Spring 2012/1b ps</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58215"/>
		<updated>2012-02-07T03:32:05Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Moore's Law ==&lt;br /&gt;
&lt;br /&gt;
In 1965, Intel co-founder Gordon Moore predicted that the number of [http://en.wikipedia.org/wiki/Transistor transistors] on a die would double every 24 months. This was a rough predictive statement that has brought Moore acclaim for its reputed accuracy and foresight. This article explores different interpretations of Moore's law, whether it has indeed held true across the years in all significant intervals of time, and whether it will hold true in the future.&lt;br /&gt;
&lt;br /&gt;
The rate of growth Moore predicted is truly staggering. The equation form of the law is: T(t)=T0 * 2^(t/2) where T0 represents the initial transistor count in the start year and T(t) the transistor count in t years. Exponential growth is somewhat lost on transistors so we'll switch to counting something more tangible: money. If one could double his wealth every 24 months and started with $1, he would have $5,931,641 in 45 years!&lt;br /&gt;
&lt;br /&gt;
But this is exactly what Intel claims it has done with the transistor density of their processors (and certainly their bottom line). To a large extent, the company has symbiotically harnessed and fueled the public awareness of Moore's law to its advantage as a marketing device, but it is still valuable to study the context of the prediction (the 60s, the advent of semi-conductor technology and miniaturization) and the reasons Moore believed the prediction would hold and whether he was right.&lt;br /&gt;
&lt;br /&gt;
== Historic Context ==&lt;br /&gt;
&lt;br /&gt;
According to David C. Brock in Understanding Moore's Law: Four Decades of Innovation:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;By 1960 miniaturization was a fundamental issue for semiconductor technology and its industry. It had become, moreover, a central factor in the semiconductor community's discussions surrounding the new integrated circuits that had been touted in 1959 by Texas Instrucments as the first realization of the &amp;quot;monolothic&amp;quot; circuit ideal.&amp;quot; .&amp;lt;ref&amp;gt;Brock, David C. ''Understanding Moore's Law: Four Decades of Innovation. Chemical Heritage Foundation, 2006, p. 26.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Two ideas are most striking about this statement: The focus on miniaturization and the concept of the &amp;quot;monolothic&amp;quot; circuit ideal.&lt;br /&gt;
&lt;br /&gt;
In the early 1960s, the semi-conductor industry was gravitating towards building [http://en.wikipedia.org/wiki/Integrated_circuit integrated circuits] on wafers of silicon versus discrete transistors for use as components in electronic devices. The arguments presented were largely those based on cost reduction. Integrated or monolothic circuits were cheaper to produce than devices based on connected discrete transistors.&lt;br /&gt;
&lt;br /&gt;
Among the seminal presentations of the time predating Moore's publication of Moore's Law is C. Harry Knowles (manager for Westinghouse's molecular electronics division). Knowles addressed two critical ideas that helps put Moore's Law into proper context: &lt;br /&gt;
Knowles argued that with technological progress, devices could produce integrated circuits with greater functionality and complexity at higher [http://en.wikipedia.org/wiki/Yield yields] (measure of output). (Brock). Secondly, Knowles brought attention to the issue of performance as a function of size.&lt;br /&gt;
&lt;br /&gt;
== Transistor Counts Over The Years ==&lt;br /&gt;
&lt;br /&gt;
The stage was set: smaller was faster and miniaturized, integrated circuits were cheaper. Moore made several contributions to Knowles in his presentation of Moore's Law, including a significant simplification of ideas to transistor counts over time. &lt;br /&gt;
&lt;br /&gt;
If Moore was right, he wasn't right on the first try. His initial prediction was that transistor counts would double every year. In 1975 he revised this prediction to every two years &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://news.cnet.com/Myths-of-Moores-Law/2010-1071_3-1014887.html &amp;quot;Kanellos, Michael. Perspective: Myths of Moore's Law&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If we take 1971 and the Intel 4004 processor as our starting point &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://en.wikipedia.org/wiki/Transistor_count &amp;quot;Microprocessors&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;, we see the following growth alongside the predicted growth and the variance and variance percentage next to each processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Processor'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Transistor count'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Year'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Predicted Value'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Variance'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Percent Error (Variance/Predicted Transistor Count)'''&lt;br /&gt;
|-&lt;br /&gt;
| Intel 4004||2,300||1971||2,300||N/A||&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8008||3,500||1972||3,253||247||8%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6800||4,100||1974||6,505||-2,405||-37%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8080||4,500||1974||6,505||-2,005||-31%&lt;br /&gt;
|-&lt;br /&gt;
| RCA 1802||5,000||1974||6,505||-1,505||-23%&lt;br /&gt;
|-&lt;br /&gt;
| MOS Technology 6502||3,510||1975||9,200||-5,690||-62%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8085||6,500||1976||13,011||-6,511||-50%&lt;br /&gt;
|-&lt;br /&gt;
| Zilog Z80||8,500||1976||13,011||-4,511||-35%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6809||9,000||1978||26,022||-17,022||-65%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8086||29,000||1978||26,022||2,978||11%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8088||29,000||1979||36,800||-7,800||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 68000||68,000||1979||36,800||31,200||85%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80186||55,000||1982||104,086||-49,086||-47%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80286||134,000||1982||104,086||29,914||29%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80386||275,000||1985||294,400||-19,400||-7%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80486||1,180,000||1989||1,177,600||2,400||0%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium||3,100,000||1993||4,710,400||-1,610,400||-34%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K5||4,300,000||1996||13,323,023||-9,023,023||-68%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium II||7,500,000||1997||18,841,600||-11,341,600||-60%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6||8,800,000||1997||18,841,600||-10,041,600||-53%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium III||9,500,000||1999||37,683,200||-28,183,200||-75%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6-III||21,300,000||1999||37,683,200||-16,383,200||-43%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K7||22,000,000||1999||37,683,200||-15,683,200||-42%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium 4||42,000,000||2000||53,292,093||-11,292,093||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Barton||54,300,000||2003||150,732,800||-96,432,800||-64%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K8||105,900,000||2003||150,732,800||-44,832,800||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2||220,000,000||2003||150,732,800||69,267,200||46%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2 with 9MB cache||592,000,000||2004||213,168,370||378,831,630||178%&lt;br /&gt;
|-&lt;br /&gt;
| Cell||241,000,000||2006||426,336,740||-185,336,740||-43%&lt;br /&gt;
|-&lt;br /&gt;
| Core 2 Duo||291,000,000||2006||426,336,740||-135,336,740||-32%&lt;br /&gt;
|-&lt;br /&gt;
| Dual-Core Itanium 2||1,700,000,000||2006||426,336,740||1,273,663,260||299%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||463,000,000||2007||602,931,200||-139,931,200||-23%&lt;br /&gt;
|-&lt;br /&gt;
| POWER6||789,000,000||2007||602,931,200||186,068,800||31%&lt;br /&gt;
|-&lt;br /&gt;
| Atom||47,000,000||2008||852,673,480||-805,673,480||-94%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||758,000,000||2008||852,673,480||-94,673,480||-11%&lt;br /&gt;
|-&lt;br /&gt;
| Core i7 (Quad)||731,000,000||2008||852,673,480||-121,673,480||-14%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Xeon 7400||1,900,000,000||2008||852,673,480||1,047,326,520||123%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Opteron 2400||904,000,000||2009||1,205,862,400||-301,862,400||-25%&lt;br /&gt;
|-&lt;br /&gt;
| 16-Core SPARC T3||1,000,000,000||2010||1,705,346,960||-705,346,960||-41%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Gulftown)||1,170,000,000||2010||1,705,346,960||-535,346,960||-31%&lt;br /&gt;
|-&lt;br /&gt;
| 8-core POWER7||1,200,000,000||2010||1,705,346,960||-505,346,960||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-core z196[3]||1,400,000,000||2010||1,705,346,960||-305,346,960||-18%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-Core Itanium Tukwila||2,000,000,000||2010||1,705,346,960||294,653,040||17%&lt;br /&gt;
|-&lt;br /&gt;
| 8-Core Xeon Nehalem-EX||2,300,000,000||2010||1,705,346,960||594,653,040||35%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Sandy Bridge-E)||2,270,000,000||2011||2,411,724,800||-141,724,800||-6%&lt;br /&gt;
|-&lt;br /&gt;
| 10-Core Xeon Westmere-EX||2,600,000,000||2011||2,411,724,800||188,275,200||8%&lt;br /&gt;
|-&lt;br /&gt;
| &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Data Review ==&lt;br /&gt;
&lt;br /&gt;
So it's quite clear that the &amp;quot;law&amp;quot; is not to be taken too literally. It's a general marker for what to expect in upcoming years. One sees actually that the number jumps in several years to catch up (by 2011, the variance is only %8 with the 10-Core Xeon chip). The Dual Core Itanium also lurches forward nearly %300.&lt;br /&gt;
&lt;br /&gt;
The following chart shows the actual and predicted transistor counts:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:TransistorCountVariance.JPG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Variance Years ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We see that there are probably exceptions which should be taken out of the data; for example, the Atom processor, which is an ultra-low voltage processor embedded in netbooks. These are special purpose processors that do not reflect the state of the technology at the time (for example, the Intel Core i7 chip in 2008, the same year as the Atom chip, is closer to the mark at a variance of only -14%, versus the Atom's -94%.&lt;br /&gt;
&lt;br /&gt;
So it is important when evaluating whether Moore's law has failed to hold to compare chips of the same family, or perhaps take the best technology for a given year. For example, in years 1979, 1982, 1985, 1989 with the Motorola 68000, Intel 80286, Intel 80386 and Intel 80486, the best chips available in the data above, either exceed Moore's predictions or come very close.&lt;br /&gt;
&lt;br /&gt;
But there are many intervals where this is not true, including 1971-1978, 1993-2000, which show poor performance. One explanation is that other chip factors started to become more important during this time (power consumption, for example). One also sees certain chips and chip families which cause transistor counts to lurch forward (for example, the first Itanium 2 which included 221 million transistors, 4 times Moore's prediction for 2006). This lurching forward may have an effect that for subsequent years, the hardware outstrips market needs and it takes some time for operating systems and software to catch up. During this period, there is less demand for more powerful chips so market forces require the production of less expensive chips with lower transistor counts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Moore's Law - The Future ==&lt;br /&gt;
&lt;br /&gt;
With a good survey of data and historic context firmly in view, we can now consider whether Moore's Law will hold in future years. Clearly, Moore's predictions concerned transistor counts, not processor speed. But like Knowles, Moore was searching for a simple metric of complexity and performance with respect to cost. His ultimate aim was to predict future performance at costs that were feasible in the marketplace. Hence, on the one hand, one can take Moore's Law to apply to performance versus the specific metric of transistor counts.&lt;br /&gt;
&lt;br /&gt;
Performance is a function of many factor besides transistor counts, including memory management and overall processor architecture (pipelining, cache levels, etc.). Still, in the context of miniaturization and monlithic architectures, it is possible that Moore underestimated the importance of these factors despite his ultimate interest in the sustained future improvement in performance with respect to cost. We must also consider that, as with any new technology, the semi-conductor industry was struggling to convince the engineering community and public of the importance of integrated circuits. The focus was deliberate and simple: adopt integrated circuits as the future.&lt;br /&gt;
&lt;br /&gt;
Those who argue that transistor counts will eventually hit a wall typically do so on the basis of physical limitations. The arguments seem reasonable given that Moore's prediction concerned the number of transistors on a single die and the general unsustainability of exponential growth. Even if the physical limitations were surmounted by better materials and semi-conductor manufacturing processes, it is likely their benefits will be outweighed by alternative innovations at other levels.&lt;br /&gt;
&lt;br /&gt;
The exponential growth predicted by Moore's law is ultimately not sustainable, even as a rough guideline. Even if it were and the above physical limitations are overcome, the resulting transistor count or performance improvement will likely also significantly outrun Moore's predictions since the technology to work with materials at an atomic level will likely lead to completely different architectures Moore could never have predicted.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
*Brock, David C. (2006). Understanding Moore's Law: Four Decades of Innovation ''Chemical Heritage Foundation''.&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58214</id>
		<title>CSC/ECE 506 Spring 2012/1b ps</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58214"/>
		<updated>2012-02-07T03:31:32Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Moore's Law ==&lt;br /&gt;
&lt;br /&gt;
In 1965, Intel co-founder Gordon Moore predicted that the number of [http://en.wikipedia.org/wiki/Transistor transistors] on a die would double every 24 months. This was a rough predictive statement that has brought Moore acclaim for its reputed accuracy and foresight. This article explores different interpretations of Moore's law, whether it has indeed held true across the years in all significant intervals of time, and whether it will hold true in the future.&lt;br /&gt;
&lt;br /&gt;
The rate of growth Moore predicted is truly staggering. The equation form of the law is: T(t)=T0 * 2^(t/2) where T0 represents the initial transistor count in the start year and T(t) the transistor count in t years. Exponential growth is somewhat lost on transistors so we'll switch to counting something more tangible: money. If one could double his wealth every 24 months and started with $1, he would have $5,931,641 in 45 years!&lt;br /&gt;
&lt;br /&gt;
But this is exactly what Intel claims it has done with the transistor density of their processors (and certainly their bottom line). To a large extent, the company has symbiotically harnessed and fueled the public awareness of Moore's law to its advantage as a marketing device, but it is still valuable to study the context of the prediction (the 60s, the advent of semi-conductor technology and miniaturization) and the reasons Moore believed the prediction would hold and whether he was right.&lt;br /&gt;
&lt;br /&gt;
== Historic Context ==&lt;br /&gt;
&lt;br /&gt;
According to David C. Brock in Understanding Moore's Law: Four Decades of Innovation:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;By 1960 miniaturization was a fundamental issue for semiconductor technology and its industry. It had become, moreover, a central factor in the semiconductor community's discussions surrounding the new integrated circuits that had been touted in 1959 by Texas Instrucments as the first realization of the &amp;quot;monolothic&amp;quot; circuit ideal.&amp;quot; .&amp;lt;ref&amp;gt;Brock, David C. ''Understanding Moore's Law: Four Decades of Innovation. Chemical Heritage Foundation, 2006, p. 26.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Two ideas are most striking about this statement: The focus on miniaturization and the concept of the &amp;quot;monolothic&amp;quot; circuit ideal.&lt;br /&gt;
&lt;br /&gt;
In the early 1960s, the semi-conductor industry was gravitating towards building [http://en.wikipedia.org/wiki/Integrated_circuit integrated circuits] on wafers of silicon versus discrete transistors for use as components in electronic devices. The arguments presented were largely those based on cost reduction. Integrated or monolothic circuits were cheaper to produce than devices based on connected discrete transistors.&lt;br /&gt;
&lt;br /&gt;
Among the seminal presentations of the time predating Moore's publication of Moore's Law is C. Harry Knowles (manager for Westinghouse's molecular electronics division). Knowles addressed two critical ideas that helps put Moore's Law into proper context: &lt;br /&gt;
Knowles argued that with technological progress, devices could produce integrated circuits with greater functionality and complexity at higher [http://en.wikipedia.org/wiki/Yield yields] (measure of output). (Brock). Secondly, Knowles brought attention to the issue of performance as a function of size.&lt;br /&gt;
&lt;br /&gt;
== Transistor Counts Over The Years ==&lt;br /&gt;
&lt;br /&gt;
The stage was set: smaller was faster and miniaturized, integrated circuits were cheaper. Moore made several contributions to Knowles in his presentation of Moore's Law, including a significant simplification of ideas to transistor counts over time. &lt;br /&gt;
&lt;br /&gt;
If Moore was right, he wasn't right on the first try. His initial prediction was that transistor counts would double every year. In 1975 he revised this prediction to every two years &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://news.cnet.com/Myths-of-Moores-Law/2010-1071_3-1014887.html &amp;quot;Kanellos, Michael. Perspective: Myths of Moore's Law&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If we take 1971 and the Intel 4004 processor as our starting point &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://en.wikipedia.org/wiki/Transistor_count &amp;quot;Microprocessors&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;, we see the following growth alongside the predicted growth and the variance and variance percentage next to each processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Processor'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Transistor count'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Year'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Predicted Value'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Variance'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Percent Error (Variance/Predicted Transistor Count)'''&lt;br /&gt;
|-&lt;br /&gt;
| Intel 4004||2,300||1971||2,300||N/A||&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8008||3,500||1972||3,253||247||8%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6800||4,100||1974||6,505||-2,405||-37%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8080||4,500||1974||6,505||-2,005||-31%&lt;br /&gt;
|-&lt;br /&gt;
| RCA 1802||5,000||1974||6,505||-1,505||-23%&lt;br /&gt;
|-&lt;br /&gt;
| MOS Technology 6502||3,510||1975||9,200||-5,690||-62%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8085||6,500||1976||13,011||-6,511||-50%&lt;br /&gt;
|-&lt;br /&gt;
| Zilog Z80||8,500||1976||13,011||-4,511||-35%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6809||9,000||1978||26,022||-17,022||-65%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8086||29,000||1978||26,022||2,978||11%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8088||29,000||1979||36,800||-7,800||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 68000||68,000||1979||36,800||31,200||85%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80186||55,000||1982||104,086||-49,086||-47%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80286||134,000||1982||104,086||29,914||29%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80386||275,000||1985||294,400||-19,400||-7%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80486||1,180,000||1989||1,177,600||2,400||0%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium||3,100,000||1993||4,710,400||-1,610,400||-34%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K5||4,300,000||1996||13,323,023||-9,023,023||-68%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium II||7,500,000||1997||18,841,600||-11,341,600||-60%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6||8,800,000||1997||18,841,600||-10,041,600||-53%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium III||9,500,000||1999||37,683,200||-28,183,200||-75%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6-III||21,300,000||1999||37,683,200||-16,383,200||-43%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K7||22,000,000||1999||37,683,200||-15,683,200||-42%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium 4||42,000,000||2000||53,292,093||-11,292,093||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Barton||54,300,000||2003||150,732,800||-96,432,800||-64%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K8||105,900,000||2003||150,732,800||-44,832,800||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2||220,000,000||2003||150,732,800||69,267,200||46%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2 with 9MB cache||592,000,000||2004||213,168,370||378,831,630||178%&lt;br /&gt;
|-&lt;br /&gt;
| Cell||241,000,000||2006||426,336,740||-185,336,740||-43%&lt;br /&gt;
|-&lt;br /&gt;
| Core 2 Duo||291,000,000||2006||426,336,740||-135,336,740||-32%&lt;br /&gt;
|-&lt;br /&gt;
| Dual-Core Itanium 2||1,700,000,000||2006||426,336,740||1,273,663,260||299%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||463,000,000||2007||602,931,200||-139,931,200||-23%&lt;br /&gt;
|-&lt;br /&gt;
| POWER6||789,000,000||2007||602,931,200||186,068,800||31%&lt;br /&gt;
|-&lt;br /&gt;
| Atom||47,000,000||2008||852,673,480||-805,673,480||-94%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||758,000,000||2008||852,673,480||-94,673,480||-11%&lt;br /&gt;
|-&lt;br /&gt;
| Core i7 (Quad)||731,000,000||2008||852,673,480||-121,673,480||-14%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Xeon 7400||1,900,000,000||2008||852,673,480||1,047,326,520||123%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Opteron 2400||904,000,000||2009||1,205,862,400||-301,862,400||-25%&lt;br /&gt;
|-&lt;br /&gt;
| 16-Core SPARC T3||1,000,000,000||2010||1,705,346,960||-705,346,960||-41%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Gulftown)||1,170,000,000||2010||1,705,346,960||-535,346,960||-31%&lt;br /&gt;
|-&lt;br /&gt;
| 8-core POWER7||1,200,000,000||2010||1,705,346,960||-505,346,960||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-core z196[3]||1,400,000,000||2010||1,705,346,960||-305,346,960||-18%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-Core Itanium Tukwila||2,000,000,000||2010||1,705,346,960||294,653,040||17%&lt;br /&gt;
|-&lt;br /&gt;
| 8-Core Xeon Nehalem-EX||2,300,000,000||2010||1,705,346,960||594,653,040||35%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Sandy Bridge-E)||2,270,000,000||2011||2,411,724,800||-141,724,800||-6%&lt;br /&gt;
|-&lt;br /&gt;
| 10-Core Xeon Westmere-EX||2,600,000,000||2011||2,411,724,800||188,275,200||8%&lt;br /&gt;
|-&lt;br /&gt;
| &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Data Review ==&lt;br /&gt;
&lt;br /&gt;
So it's quite clear that the &amp;quot;law&amp;quot; is not to be taken too literally. It's a general marker for what to expect in upcoming years. One sees actually that the number jumps in several years to catch up (by 2011, the variance is only %8 with the 10-Core Xeon chip). The Dual Core Itanium also lurches forward nearly %300.&lt;br /&gt;
&lt;br /&gt;
The following chart shows the actual and predicted transistor counts:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:TransistorCountVariance.JPG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Variance Years ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We see that there are probably exceptions which should be taken out of the data; for example, the Atom processor, which is an ultra-low voltage processor embedded in netbooks. These are special purpose processors that do not reflect the state of the technology at the time (for example, the Intel Core i7 chip in 2008, the same year as the Atom chip, is closer to the mark at a variance of only -14%, versus the Atom's -94%.&lt;br /&gt;
&lt;br /&gt;
So it is important when evaluating whether Moore's law has failed to hold to compare chips of the same family, or perhaps take the best technology for a given year. For example, in years 1979, 1982, 1985, 1989 with the Motorola 68000, Intel 80286, Intel 80386 and Intel 80486, the best chips available in the data above, either exceed Moore's predictions or come very close.&lt;br /&gt;
&lt;br /&gt;
But there are many intervals where this is not true, including 1971-1978, 1993-2000, which show poor performance. One explanation is that other chip factors started to become more important during this time (power consumption, for example). One also sees certain chips and chip families which cause transistor counts to lurch forward (for example, the first Itanium 2 which included 221 million transistors, 4 times Moore's prediction for 2006). This lurching forward may have an effect that for subsequent years, the hardware outstrips market needs and it takes some time for operating systems and software to catch up. During this period, there is less demand for more powerful chips so market forces require the production of less expensive chips with lower transistor counts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Moore's Law - The Future ==&lt;br /&gt;
&lt;br /&gt;
With a good survey of data and historic context firmly in view, we can now consider whether Moore's Law will hold in future years. Clearly, Moore's predictions concerned transistor counts, not processor speed. But like Knowles, Moore was searching for a simple metric of complexity and performance with respect to cost. His ultimate aim was to predict future performance at costs that were feasible in the marketplace. Hence, on the one hand, one can take Moore's Law to apply to performance versus the specific metric of transistor counts.&lt;br /&gt;
&lt;br /&gt;
Performance is a function of many factor besides transistor counts, including memory management and overall processor architecture (pipelining, cache levels, etc.). Still, in the context of miniaturization and monlithic architectures, it is possible that Moore underestimated the importance of these factors despite his ultimate interest in the sustained future improvement in performance with respect to cost. We must also consider that, as with any new technology, the semi-conductor industry was struggling to convince the engineering community and public of the importance of integrated circuits. The focus was deliberate and simple: adopt integrated circuits as the future.&lt;br /&gt;
&lt;br /&gt;
Those who argue that transistor counts will eventually hit a wall typically do so on the basis of physical limitations. The arguments seem reasonable given that Moore's prediction concerned the number of transistors on a single die and the general unsustainability of exponential growth. Even if the physical limitations were surmounted by better materials and semi-conductor manufacturing processes, it is likely their benefits will be outweighed by alternative innovations at other levels.&lt;br /&gt;
&lt;br /&gt;
The exponential growth predicted by Moore's law is ultimately not sustainable, even as a rough guideline. Even if it were and the above physical limitations are overcome, the resulting transistor count or performance improvement will likely also significantly outrun Moore's predictions since the technology to work with materials at an atomic level will likely lead to completely different architectures Moore could never have predicted.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
*Brock, David C. (2006). &amp;quot;Understanding Moore's Law: Four Decades of Innovation&amp;quot; ''Chemical Heritage Foundation''.&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58212</id>
		<title>CSC/ECE 506 Spring 2012/1b ps</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58212"/>
		<updated>2012-02-07T03:30:27Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Moore's Law */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Moore's Law ==&lt;br /&gt;
&lt;br /&gt;
In 1965, Intel co-founder Gordon Moore predicted that the number of [http://en.wikipedia.org/wiki/Transistor transistors] on a die would double every 24 months. This was a rough predictive statement that has brought Moore acclaim for its reputed accuracy and foresight. This article explores different interpretations of Moore's law, whether it has indeed held true across the years in all significant intervals of time, and whether it will hold true in the future.&lt;br /&gt;
&lt;br /&gt;
The rate of growth Moore predicted is truly staggering. The equation form of the law is: T(t)=T0 * 2^(t/2) where T0 represents the initial transistor count in the start year and T(t) the transistor count in t years. Exponential growth is somewhat lost on transistors so we'll switch to counting something more tangible: money. If one could double his wealth every 24 months and started with $1, he would have $5,931,641 in 45 years!&lt;br /&gt;
&lt;br /&gt;
But this is exactly what Intel claims it has done with the transistor density of their processors (and certainly their bottom line). To a large extent, the company has symbiotically harnessed and fueled the public awareness of Moore's law to its advantage as a marketing device, but it is still valuable to study the context of the prediction (the 60s, the advent of semi-conductor technology and miniaturization) and the reasons Moore believed the prediction would hold and whether he was right.&lt;br /&gt;
&lt;br /&gt;
== Historic Context ==&lt;br /&gt;
&lt;br /&gt;
According to David C. Brock in Understanding Moore's Law: four decades of innovation:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;By 1960 miniaturization was a fundamental issue for semiconductor technology and its industry. It had become, moreover, a central factor in the semiconductor community's discussions surrounding the new integrated circuits that had been touted in 1959 by Texas Instrucments as the first realization of the &amp;quot;monolothic&amp;quot; circuit ideal.&amp;quot; .&amp;lt;ref&amp;gt;Brock, David C. ''Understanding Moore's Law: fource decades of innovation. Chemical Heritage Foundation, 2006, p. 26.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Two ideas are most striking about this statement: The focus on miniaturization and the concept of the &amp;quot;monolothic&amp;quot; circuit ideal.&lt;br /&gt;
&lt;br /&gt;
In the early 1960s, the semi-conductor industry was gravitating towards building [http://en.wikipedia.org/wiki/Integrated_circuit integrated circuits] on wafers of silicon versus discrete transistors for use as components in electronic devices. The arguments presented were largely those based on cost reduction. Integrated or monolothic circuits were cheaper to produce than devices based on connected discrete transistors.&lt;br /&gt;
&lt;br /&gt;
Among the seminal presentations of the time predating Moore's publication of Moore's Law is C. Harry Knowles (manager for Westinghouse's molecular electronics division). Knowles addressed two critical ideas that helps put Moore's Law into proper context: &lt;br /&gt;
Knowles argued that with technological progress, devices could produce integrated circuits with greater functionality and complexity at higher [http://en.wikipedia.org/wiki/Yield yields] (measure of output). (Brock). Secondly, Knowles brought attention to the issue of performance as a function of size.&lt;br /&gt;
&lt;br /&gt;
== Transistor Counts Over The Years ==&lt;br /&gt;
&lt;br /&gt;
The stage was set: smaller was faster and miniaturized, integrated circuits were cheaper. Moore made several contributions to Knowles in his presentation of Moore's Law, including a significant simplification of ideas to transistor counts over time. &lt;br /&gt;
&lt;br /&gt;
If Moore was right, he wasn't right on the first try. His initial prediction was that transistor counts would double every year. In 1975 he revised this prediction to every two years &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://news.cnet.com/Myths-of-Moores-Law/2010-1071_3-1014887.html &amp;quot;Kanellos, Michael. Perspective: Myths of Moore's Law&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If we take 1971 and the Intel 4004 processor as our starting point &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://en.wikipedia.org/wiki/Transistor_count &amp;quot;Microprocessors&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;, we see the following growth alongside the predicted growth and the variance and variance percentage next to each processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Processor'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Transistor count'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Year'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Predicted Value'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Variance'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Percent Error (Variance/Predicted Transistor Count)'''&lt;br /&gt;
|-&lt;br /&gt;
| Intel 4004||2,300||1971||2,300||N/A||&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8008||3,500||1972||3,253||247||8%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6800||4,100||1974||6,505||-2,405||-37%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8080||4,500||1974||6,505||-2,005||-31%&lt;br /&gt;
|-&lt;br /&gt;
| RCA 1802||5,000||1974||6,505||-1,505||-23%&lt;br /&gt;
|-&lt;br /&gt;
| MOS Technology 6502||3,510||1975||9,200||-5,690||-62%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8085||6,500||1976||13,011||-6,511||-50%&lt;br /&gt;
|-&lt;br /&gt;
| Zilog Z80||8,500||1976||13,011||-4,511||-35%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6809||9,000||1978||26,022||-17,022||-65%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8086||29,000||1978||26,022||2,978||11%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8088||29,000||1979||36,800||-7,800||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 68000||68,000||1979||36,800||31,200||85%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80186||55,000||1982||104,086||-49,086||-47%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80286||134,000||1982||104,086||29,914||29%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80386||275,000||1985||294,400||-19,400||-7%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80486||1,180,000||1989||1,177,600||2,400||0%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium||3,100,000||1993||4,710,400||-1,610,400||-34%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K5||4,300,000||1996||13,323,023||-9,023,023||-68%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium II||7,500,000||1997||18,841,600||-11,341,600||-60%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6||8,800,000||1997||18,841,600||-10,041,600||-53%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium III||9,500,000||1999||37,683,200||-28,183,200||-75%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6-III||21,300,000||1999||37,683,200||-16,383,200||-43%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K7||22,000,000||1999||37,683,200||-15,683,200||-42%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium 4||42,000,000||2000||53,292,093||-11,292,093||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Barton||54,300,000||2003||150,732,800||-96,432,800||-64%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K8||105,900,000||2003||150,732,800||-44,832,800||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2||220,000,000||2003||150,732,800||69,267,200||46%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2 with 9MB cache||592,000,000||2004||213,168,370||378,831,630||178%&lt;br /&gt;
|-&lt;br /&gt;
| Cell||241,000,000||2006||426,336,740||-185,336,740||-43%&lt;br /&gt;
|-&lt;br /&gt;
| Core 2 Duo||291,000,000||2006||426,336,740||-135,336,740||-32%&lt;br /&gt;
|-&lt;br /&gt;
| Dual-Core Itanium 2||1,700,000,000||2006||426,336,740||1,273,663,260||299%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||463,000,000||2007||602,931,200||-139,931,200||-23%&lt;br /&gt;
|-&lt;br /&gt;
| POWER6||789,000,000||2007||602,931,200||186,068,800||31%&lt;br /&gt;
|-&lt;br /&gt;
| Atom||47,000,000||2008||852,673,480||-805,673,480||-94%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||758,000,000||2008||852,673,480||-94,673,480||-11%&lt;br /&gt;
|-&lt;br /&gt;
| Core i7 (Quad)||731,000,000||2008||852,673,480||-121,673,480||-14%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Xeon 7400||1,900,000,000||2008||852,673,480||1,047,326,520||123%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Opteron 2400||904,000,000||2009||1,205,862,400||-301,862,400||-25%&lt;br /&gt;
|-&lt;br /&gt;
| 16-Core SPARC T3||1,000,000,000||2010||1,705,346,960||-705,346,960||-41%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Gulftown)||1,170,000,000||2010||1,705,346,960||-535,346,960||-31%&lt;br /&gt;
|-&lt;br /&gt;
| 8-core POWER7||1,200,000,000||2010||1,705,346,960||-505,346,960||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-core z196[3]||1,400,000,000||2010||1,705,346,960||-305,346,960||-18%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-Core Itanium Tukwila||2,000,000,000||2010||1,705,346,960||294,653,040||17%&lt;br /&gt;
|-&lt;br /&gt;
| 8-Core Xeon Nehalem-EX||2,300,000,000||2010||1,705,346,960||594,653,040||35%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Sandy Bridge-E)||2,270,000,000||2011||2,411,724,800||-141,724,800||-6%&lt;br /&gt;
|-&lt;br /&gt;
| 10-Core Xeon Westmere-EX||2,600,000,000||2011||2,411,724,800||188,275,200||8%&lt;br /&gt;
|-&lt;br /&gt;
| &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Data Review ==&lt;br /&gt;
&lt;br /&gt;
So it's quite clear that the &amp;quot;law&amp;quot; is not to be taken too literally. It's a general marker for what to expect in upcoming years. One sees actually that the number jumps in several years to catch up (by 2011, the variance is only %8 with the 10-Core Xeon chip). The Dual Core Itanium also lurches forward nearly %300.&lt;br /&gt;
&lt;br /&gt;
The following chart shows the actual and predicted transistor counts:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:TransistorCountVariance.JPG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Variance Years ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We see that there are probably exceptions which should be taken out of the data; for example, the Atom processor, which is an ultra-low voltage processor embedded in netbooks. These are special purpose processors that do not reflect the state of the technology at the time (for example, the Intel Core i7 chip in 2008, the same year as the Atom chip, is closer to the mark at a variance of only -14%, versus the Atom's -94%.&lt;br /&gt;
&lt;br /&gt;
So it is important when evaluating whether Moore's law has failed to hold to compare chips of the same family, or perhaps take the best technology for a given year. For example, in years 1979, 1982, 1985, 1989 with the Motorola 68000, Intel 80286, Intel 80386 and Intel 80486, the best chips available in the data above, either exceed Moore's predictions or come very close.&lt;br /&gt;
&lt;br /&gt;
But there are many intervals where this is not true, including 1971-1978, 1993-2000, which show poor performance. One explanation is that other chip factors started to become more important during this time (power consumption, for example). One also sees certain chips and chip families which cause transistor counts to lurch forward (for example, the first Itanium 2 which included 221 million transistors, 4 times Moore's prediction for 2006). This lurching forward may have an effect that for subsequent years, the hardware outstrips market needs and it takes some time for operating systems and software to catch up. During this period, there is less demand for more powerful chips so market forces require the production of less expensive chips with lower transistor counts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Moore's Law - The Future ==&lt;br /&gt;
&lt;br /&gt;
With a good survey of data and historic context firmly in view, we can now consider whether Moore's Law will hold in future years. Clearly, Moore's predictions concerned transistor counts, not processor speed. But like Knowles, Moore was searching for a simple metric of complexity and performance with respect to cost. His ultimate aim was to predict future performance at costs that were feasible in the marketplace. Hence, on the one hand, one can take Moore's Law to apply to performance versus the specific metric of transistor counts.&lt;br /&gt;
&lt;br /&gt;
Performance is a function of many factor besides transistor counts, including memory management and overall processor architecture (pipelining, cache levels, etc.). Still, in the context of miniaturization and monlithic architectures, it is possible that Moore underestimated the importance of these factors despite his ultimate interest in the sustained future improvement in performance with respect to cost. We must also consider that, as with any new technology, the semi-conductor industry was struggling to convince the engineering community and public of the importance of integrated circuits. The focus was deliberate and simple: adopt integrated circuits as the future.&lt;br /&gt;
&lt;br /&gt;
Those who argue that transistor counts will eventually hit a wall typically do so on the basis of physical limitations. The arguments seem reasonable given that Moore's prediction concerned the number of transistors on a single die and the general unsustainability of exponential growth. Even if the physical limitations were surmounted by better materials and semi-conductor manufacturing processes, it is likely their benefits will be outweighed by alternative innovations at other levels.&lt;br /&gt;
&lt;br /&gt;
The exponential growth predicted by Moore's law is ultimately not sustainable, even as a rough guideline. Even if it were and the above physical limitations are overcome, the resulting transistor count or performance improvement will likely also significantly outrun Moore's predictions since the technology to work with materials at an atomic level will likely lead to completely different architectures Moore could never have predicted.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
*Brock, David C. (2006). &amp;quot;Understanding Moore's Law: four decades of innovation&amp;quot; ''Chemical Heritage Foundation''.&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58210</id>
		<title>CSC/ECE 506 Spring 2012/1b ps</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58210"/>
		<updated>2012-02-07T03:28:56Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Moore's Law */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Moore's Law ==&lt;br /&gt;
&lt;br /&gt;
In 1965, Intel co-founder Gordon Moore predicted that the number of [http://en.wikipedia.org/wiki/Transistor transistors] on a die would double every 24 months. This was a rough predictive statement that has brought Moore acclaim for its reputed accuracy and foresight. This article explores different interpretations of Moore's law, whether it has indeed held true across the years in all significant intervals of time, and whether it will hold true in the future.&lt;br /&gt;
&lt;br /&gt;
The rate of growth Moore predicted is truly staggering. The equation form of the law is: T(t)=T0 * 2^(t/2) where T0 represents the initial transistor count in the start year and T(t) the transistor count in t years. Exponential growth is somewhat lost on transistors so we'll switch to counting something more tangible (and one most everyone understands): money. If one could double her wealth every 24 months and started with $1, she would have $5,931,641 in 45 years!&lt;br /&gt;
&lt;br /&gt;
But this is exactly what Intel claims it has done with the transistor density of their processors (and certainly their bottom line). To a large extent, the company has symbiotically harnessed and fueled the public awareness of Moore's law to its advantage as a marketing device, but it is still valuable to study the context of the prediction (the 60s, the advent of semi-conductor technology and miniaturization) and the reasons Moore believed the prediction would hold and whether he was right.&lt;br /&gt;
&lt;br /&gt;
== Historic Context ==&lt;br /&gt;
&lt;br /&gt;
According to David C. Brock in Understanding Moore's Law: four decades of innovation:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;By 1960 miniaturization was a fundamental issue for semiconductor technology and its industry. It had become, moreover, a central factor in the semiconductor community's discussions surrounding the new integrated circuits that had been touted in 1959 by Texas Instrucments as the first realization of the &amp;quot;monolothic&amp;quot; circuit ideal.&amp;quot; .&amp;lt;ref&amp;gt;Brock, David C. ''Understanding Moore's Law: fource decades of innovation. Chemical Heritage Foundation, 2006, p. 26.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Two ideas are most striking about this statement: The focus on miniaturization and the concept of the &amp;quot;monolothic&amp;quot; circuit ideal.&lt;br /&gt;
&lt;br /&gt;
In the early 1960s, the semi-conductor industry was gravitating towards building [http://en.wikipedia.org/wiki/Integrated_circuit integrated circuits] on wafers of silicon versus discrete transistors for use as components in electronic devices. The arguments presented were largely those based on cost reduction. Integrated or monolothic circuits were cheaper to produce than devices based on connected discrete transistors.&lt;br /&gt;
&lt;br /&gt;
Among the seminal presentations of the time predating Moore's publication of Moore's Law is C. Harry Knowles (manager for Westinghouse's molecular electronics division). Knowles addressed two critical ideas that helps put Moore's Law into proper context: &lt;br /&gt;
Knowles argued that with technological progress, devices could produce integrated circuits with greater functionality and complexity at higher [http://en.wikipedia.org/wiki/Yield yields] (measure of output). (Brock). Secondly, Knowles brought attention to the issue of performance as a function of size.&lt;br /&gt;
&lt;br /&gt;
== Transistor Counts Over The Years ==&lt;br /&gt;
&lt;br /&gt;
The stage was set: smaller was faster and miniaturized, integrated circuits were cheaper. Moore made several contributions to Knowles in his presentation of Moore's Law, including a significant simplification of ideas to transistor counts over time. &lt;br /&gt;
&lt;br /&gt;
If Moore was right, he wasn't right on the first try. His initial prediction was that transistor counts would double every year. In 1975 he revised this prediction to every two years &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://news.cnet.com/Myths-of-Moores-Law/2010-1071_3-1014887.html &amp;quot;Kanellos, Michael. Perspective: Myths of Moore's Law&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If we take 1971 and the Intel 4004 processor as our starting point &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://en.wikipedia.org/wiki/Transistor_count &amp;quot;Microprocessors&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;, we see the following growth alongside the predicted growth and the variance and variance percentage next to each processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Processor'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Transistor count'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Year'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Predicted Value'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Variance'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Percent Error (Variance/Predicted Transistor Count)'''&lt;br /&gt;
|-&lt;br /&gt;
| Intel 4004||2,300||1971||2,300||N/A||&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8008||3,500||1972||3,253||247||8%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6800||4,100||1974||6,505||-2,405||-37%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8080||4,500||1974||6,505||-2,005||-31%&lt;br /&gt;
|-&lt;br /&gt;
| RCA 1802||5,000||1974||6,505||-1,505||-23%&lt;br /&gt;
|-&lt;br /&gt;
| MOS Technology 6502||3,510||1975||9,200||-5,690||-62%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8085||6,500||1976||13,011||-6,511||-50%&lt;br /&gt;
|-&lt;br /&gt;
| Zilog Z80||8,500||1976||13,011||-4,511||-35%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6809||9,000||1978||26,022||-17,022||-65%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8086||29,000||1978||26,022||2,978||11%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8088||29,000||1979||36,800||-7,800||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 68000||68,000||1979||36,800||31,200||85%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80186||55,000||1982||104,086||-49,086||-47%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80286||134,000||1982||104,086||29,914||29%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80386||275,000||1985||294,400||-19,400||-7%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80486||1,180,000||1989||1,177,600||2,400||0%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium||3,100,000||1993||4,710,400||-1,610,400||-34%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K5||4,300,000||1996||13,323,023||-9,023,023||-68%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium II||7,500,000||1997||18,841,600||-11,341,600||-60%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6||8,800,000||1997||18,841,600||-10,041,600||-53%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium III||9,500,000||1999||37,683,200||-28,183,200||-75%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6-III||21,300,000||1999||37,683,200||-16,383,200||-43%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K7||22,000,000||1999||37,683,200||-15,683,200||-42%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium 4||42,000,000||2000||53,292,093||-11,292,093||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Barton||54,300,000||2003||150,732,800||-96,432,800||-64%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K8||105,900,000||2003||150,732,800||-44,832,800||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2||220,000,000||2003||150,732,800||69,267,200||46%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2 with 9MB cache||592,000,000||2004||213,168,370||378,831,630||178%&lt;br /&gt;
|-&lt;br /&gt;
| Cell||241,000,000||2006||426,336,740||-185,336,740||-43%&lt;br /&gt;
|-&lt;br /&gt;
| Core 2 Duo||291,000,000||2006||426,336,740||-135,336,740||-32%&lt;br /&gt;
|-&lt;br /&gt;
| Dual-Core Itanium 2||1,700,000,000||2006||426,336,740||1,273,663,260||299%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||463,000,000||2007||602,931,200||-139,931,200||-23%&lt;br /&gt;
|-&lt;br /&gt;
| POWER6||789,000,000||2007||602,931,200||186,068,800||31%&lt;br /&gt;
|-&lt;br /&gt;
| Atom||47,000,000||2008||852,673,480||-805,673,480||-94%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||758,000,000||2008||852,673,480||-94,673,480||-11%&lt;br /&gt;
|-&lt;br /&gt;
| Core i7 (Quad)||731,000,000||2008||852,673,480||-121,673,480||-14%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Xeon 7400||1,900,000,000||2008||852,673,480||1,047,326,520||123%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Opteron 2400||904,000,000||2009||1,205,862,400||-301,862,400||-25%&lt;br /&gt;
|-&lt;br /&gt;
| 16-Core SPARC T3||1,000,000,000||2010||1,705,346,960||-705,346,960||-41%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Gulftown)||1,170,000,000||2010||1,705,346,960||-535,346,960||-31%&lt;br /&gt;
|-&lt;br /&gt;
| 8-core POWER7||1,200,000,000||2010||1,705,346,960||-505,346,960||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-core z196[3]||1,400,000,000||2010||1,705,346,960||-305,346,960||-18%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-Core Itanium Tukwila||2,000,000,000||2010||1,705,346,960||294,653,040||17%&lt;br /&gt;
|-&lt;br /&gt;
| 8-Core Xeon Nehalem-EX||2,300,000,000||2010||1,705,346,960||594,653,040||35%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Sandy Bridge-E)||2,270,000,000||2011||2,411,724,800||-141,724,800||-6%&lt;br /&gt;
|-&lt;br /&gt;
| 10-Core Xeon Westmere-EX||2,600,000,000||2011||2,411,724,800||188,275,200||8%&lt;br /&gt;
|-&lt;br /&gt;
| &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Data Review ==&lt;br /&gt;
&lt;br /&gt;
So it's quite clear that the &amp;quot;law&amp;quot; is not to be taken too literally. It's a general marker for what to expect in upcoming years. One sees actually that the number jumps in several years to catch up (by 2011, the variance is only %8 with the 10-Core Xeon chip). The Dual Core Itanium also lurches forward nearly %300.&lt;br /&gt;
&lt;br /&gt;
The following chart shows the actual and predicted transistor counts:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:TransistorCountVariance.JPG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Variance Years ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We see that there are probably exceptions which should be taken out of the data; for example, the Atom processor, which is an ultra-low voltage processor embedded in netbooks. These are special purpose processors that do not reflect the state of the technology at the time (for example, the Intel Core i7 chip in 2008, the same year as the Atom chip, is closer to the mark at a variance of only -14%, versus the Atom's -94%.&lt;br /&gt;
&lt;br /&gt;
So it is important when evaluating whether Moore's law has failed to hold to compare chips of the same family, or perhaps take the best technology for a given year. For example, in years 1979, 1982, 1985, 1989 with the Motorola 68000, Intel 80286, Intel 80386 and Intel 80486, the best chips available in the data above, either exceed Moore's predictions or come very close.&lt;br /&gt;
&lt;br /&gt;
But there are many intervals where this is not true, including 1971-1978, 1993-2000, which show poor performance. One explanation is that other chip factors started to become more important during this time (power consumption, for example). One also sees certain chips and chip families which cause transistor counts to lurch forward (for example, the first Itanium 2 which included 221 million transistors, 4 times Moore's prediction for 2006). This lurching forward may have an effect that for subsequent years, the hardware outstrips market needs and it takes some time for operating systems and software to catch up. During this period, there is less demand for more powerful chips so market forces require the production of less expensive chips with lower transistor counts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Moore's Law - The Future ==&lt;br /&gt;
&lt;br /&gt;
With a good survey of data and historic context firmly in view, we can now consider whether Moore's Law will hold in future years. Clearly, Moore's predictions concerned transistor counts, not processor speed. But like Knowles, Moore was searching for a simple metric of complexity and performance with respect to cost. His ultimate aim was to predict future performance at costs that were feasible in the marketplace. Hence, on the one hand, one can take Moore's Law to apply to performance versus the specific metric of transistor counts.&lt;br /&gt;
&lt;br /&gt;
Performance is a function of many factor besides transistor counts, including memory management and overall processor architecture (pipelining, cache levels, etc.). Still, in the context of miniaturization and monlithic architectures, it is possible that Moore underestimated the importance of these factors despite his ultimate interest in the sustained future improvement in performance with respect to cost. We must also consider that, as with any new technology, the semi-conductor industry was struggling to convince the engineering community and public of the importance of integrated circuits. The focus was deliberate and simple: adopt integrated circuits as the future.&lt;br /&gt;
&lt;br /&gt;
Those who argue that transistor counts will eventually hit a wall typically do so on the basis of physical limitations. The arguments seem reasonable given that Moore's prediction concerned the number of transistors on a single die and the general unsustainability of exponential growth. Even if the physical limitations were surmounted by better materials and semi-conductor manufacturing processes, it is likely their benefits will be outweighed by alternative innovations at other levels.&lt;br /&gt;
&lt;br /&gt;
The exponential growth predicted by Moore's law is ultimately not sustainable, even as a rough guideline. Even if it were and the above physical limitations are overcome, the resulting transistor count or performance improvement will likely also significantly outrun Moore's predictions since the technology to work with materials at an atomic level will likely lead to completely different architectures Moore could never have predicted.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
*Brock, David C. (2006). &amp;quot;Understanding Moore's Law: four decades of innovation&amp;quot; ''Chemical Heritage Foundation''.&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58208</id>
		<title>CSC/ECE 506 Spring 2012/1b ps</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58208"/>
		<updated>2012-02-07T03:25:57Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Transistor Counts Over The Years */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Moore's Law ==&lt;br /&gt;
&lt;br /&gt;
In 1965, Intel co-founder predicted that the number of [http://en.wikipedia.org/wiki/Transistor transistors] on a die would double every 24 months. This was a rough predictive statement that has brought Moore acclaim for its reputed accuracy and foresight. This article explores different interpretations of Moore's law, whether it has indeed held true across the years in all significant intervals of time, and whether it will hold true in the future.&lt;br /&gt;
&lt;br /&gt;
The rate of growth Moore predicted is truly staggering. The equation form of the law is: T(t)=T0 * 2^(t/2) where T0 represents the initial transistor count in the start year and T(t) the transistor count in t years. Exponential growth is somewhat lost on transistors so we'll switch to counting something more tangible (and one most everyone understands): money. If one could double her wealth every 24 months and started with $1, she would have $5,931,641 in 45 years!&lt;br /&gt;
&lt;br /&gt;
But this is exactly what Intel claims it has done with the transistor density of their processors (and certainly their bottom line). To a large extent, the company has symbiotically harnessed and fueled the public awareness of Moore's law to its advantage as a marketing device, but it is still valuable to study the context of the prediction (the 60s, the advent of semi-conductor technology and miniaturization) and the reasons Moore believed the prediction would hold and whether he was right.&lt;br /&gt;
&lt;br /&gt;
== Historic Context ==&lt;br /&gt;
&lt;br /&gt;
According to David C. Brock in Understanding Moore's Law: four decades of innovation:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;By 1960 miniaturization was a fundamental issue for semiconductor technology and its industry. It had become, moreover, a central factor in the semiconductor community's discussions surrounding the new integrated circuits that had been touted in 1959 by Texas Instrucments as the first realization of the &amp;quot;monolothic&amp;quot; circuit ideal.&amp;quot; .&amp;lt;ref&amp;gt;Brock, David C. ''Understanding Moore's Law: fource decades of innovation. Chemical Heritage Foundation, 2006, p. 26.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Two ideas are most striking about this statement: The focus on miniaturization and the concept of the &amp;quot;monolothic&amp;quot; circuit ideal.&lt;br /&gt;
&lt;br /&gt;
In the early 1960s, the semi-conductor industry was gravitating towards building [http://en.wikipedia.org/wiki/Integrated_circuit integrated circuits] on wafers of silicon versus discrete transistors for use as components in electronic devices. The arguments presented were largely those based on cost reduction. Integrated or monolothic circuits were cheaper to produce than devices based on connected discrete transistors.&lt;br /&gt;
&lt;br /&gt;
Among the seminal presentations of the time predating Moore's publication of Moore's Law is C. Harry Knowles (manager for Westinghouse's molecular electronics division). Knowles addressed two critical ideas that helps put Moore's Law into proper context: &lt;br /&gt;
Knowles argued that with technological progress, devices could produce integrated circuits with greater functionality and complexity at higher [http://en.wikipedia.org/wiki/Yield yields] (measure of output). (Brock). Secondly, Knowles brought attention to the issue of performance as a function of size.&lt;br /&gt;
&lt;br /&gt;
== Transistor Counts Over The Years ==&lt;br /&gt;
&lt;br /&gt;
The stage was set: smaller was faster and miniaturized, integrated circuits were cheaper. Moore made several contributions to Knowles in his presentation of Moore's Law, including a significant simplification of ideas to transistor counts over time. &lt;br /&gt;
&lt;br /&gt;
If Moore was right, he wasn't right on the first try. His initial prediction was that transistor counts would double every year. In 1975 he revised this prediction to every two years &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://news.cnet.com/Myths-of-Moores-Law/2010-1071_3-1014887.html &amp;quot;Kanellos, Michael. Perspective: Myths of Moore's Law&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If we take 1971 and the Intel 4004 processor as our starting point &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://en.wikipedia.org/wiki/Transistor_count &amp;quot;Microprocessors&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;, we see the following growth alongside the predicted growth and the variance and variance percentage next to each processor.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Processor'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Transistor count'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Year'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Predicted Value'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Variance'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Percent Error (Variance/Predicted Transistor Count)'''&lt;br /&gt;
|-&lt;br /&gt;
| Intel 4004||2,300||1971||2,300||N/A||&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8008||3,500||1972||3,253||247||8%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6800||4,100||1974||6,505||-2,405||-37%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8080||4,500||1974||6,505||-2,005||-31%&lt;br /&gt;
|-&lt;br /&gt;
| RCA 1802||5,000||1974||6,505||-1,505||-23%&lt;br /&gt;
|-&lt;br /&gt;
| MOS Technology 6502||3,510||1975||9,200||-5,690||-62%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8085||6,500||1976||13,011||-6,511||-50%&lt;br /&gt;
|-&lt;br /&gt;
| Zilog Z80||8,500||1976||13,011||-4,511||-35%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6809||9,000||1978||26,022||-17,022||-65%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8086||29,000||1978||26,022||2,978||11%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8088||29,000||1979||36,800||-7,800||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 68000||68,000||1979||36,800||31,200||85%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80186||55,000||1982||104,086||-49,086||-47%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80286||134,000||1982||104,086||29,914||29%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80386||275,000||1985||294,400||-19,400||-7%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80486||1,180,000||1989||1,177,600||2,400||0%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium||3,100,000||1993||4,710,400||-1,610,400||-34%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K5||4,300,000||1996||13,323,023||-9,023,023||-68%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium II||7,500,000||1997||18,841,600||-11,341,600||-60%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6||8,800,000||1997||18,841,600||-10,041,600||-53%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium III||9,500,000||1999||37,683,200||-28,183,200||-75%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6-III||21,300,000||1999||37,683,200||-16,383,200||-43%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K7||22,000,000||1999||37,683,200||-15,683,200||-42%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium 4||42,000,000||2000||53,292,093||-11,292,093||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Barton||54,300,000||2003||150,732,800||-96,432,800||-64%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K8||105,900,000||2003||150,732,800||-44,832,800||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2||220,000,000||2003||150,732,800||69,267,200||46%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2 with 9MB cache||592,000,000||2004||213,168,370||378,831,630||178%&lt;br /&gt;
|-&lt;br /&gt;
| Cell||241,000,000||2006||426,336,740||-185,336,740||-43%&lt;br /&gt;
|-&lt;br /&gt;
| Core 2 Duo||291,000,000||2006||426,336,740||-135,336,740||-32%&lt;br /&gt;
|-&lt;br /&gt;
| Dual-Core Itanium 2||1,700,000,000||2006||426,336,740||1,273,663,260||299%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||463,000,000||2007||602,931,200||-139,931,200||-23%&lt;br /&gt;
|-&lt;br /&gt;
| POWER6||789,000,000||2007||602,931,200||186,068,800||31%&lt;br /&gt;
|-&lt;br /&gt;
| Atom||47,000,000||2008||852,673,480||-805,673,480||-94%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||758,000,000||2008||852,673,480||-94,673,480||-11%&lt;br /&gt;
|-&lt;br /&gt;
| Core i7 (Quad)||731,000,000||2008||852,673,480||-121,673,480||-14%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Xeon 7400||1,900,000,000||2008||852,673,480||1,047,326,520||123%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Opteron 2400||904,000,000||2009||1,205,862,400||-301,862,400||-25%&lt;br /&gt;
|-&lt;br /&gt;
| 16-Core SPARC T3||1,000,000,000||2010||1,705,346,960||-705,346,960||-41%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Gulftown)||1,170,000,000||2010||1,705,346,960||-535,346,960||-31%&lt;br /&gt;
|-&lt;br /&gt;
| 8-core POWER7||1,200,000,000||2010||1,705,346,960||-505,346,960||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-core z196[3]||1,400,000,000||2010||1,705,346,960||-305,346,960||-18%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-Core Itanium Tukwila||2,000,000,000||2010||1,705,346,960||294,653,040||17%&lt;br /&gt;
|-&lt;br /&gt;
| 8-Core Xeon Nehalem-EX||2,300,000,000||2010||1,705,346,960||594,653,040||35%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Sandy Bridge-E)||2,270,000,000||2011||2,411,724,800||-141,724,800||-6%&lt;br /&gt;
|-&lt;br /&gt;
| 10-Core Xeon Westmere-EX||2,600,000,000||2011||2,411,724,800||188,275,200||8%&lt;br /&gt;
|-&lt;br /&gt;
| &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Data Review ==&lt;br /&gt;
&lt;br /&gt;
So it's quite clear that the &amp;quot;law&amp;quot; is not to be taken too literally. It's a general marker for what to expect in upcoming years. One sees actually that the number jumps in several years to catch up (by 2011, the variance is only %8 with the 10-Core Xeon chip). The Dual Core Itanium also lurches forward nearly %300.&lt;br /&gt;
&lt;br /&gt;
The following chart shows the actual and predicted transistor counts:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:TransistorCountVariance.JPG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Variance Years ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We see that there are probably exceptions which should be taken out of the data; for example, the Atom processor, which is an ultra-low voltage processor embedded in netbooks. These are special purpose processors that do not reflect the state of the technology at the time (for example, the Intel Core i7 chip in 2008, the same year as the Atom chip, is closer to the mark at a variance of only -14%, versus the Atom's -94%.&lt;br /&gt;
&lt;br /&gt;
So it is important when evaluating whether Moore's law has failed to hold to compare chips of the same family, or perhaps take the best technology for a given year. For example, in years 1979, 1982, 1985, 1989 with the Motorola 68000, Intel 80286, Intel 80386 and Intel 80486, the best chips available in the data above, either exceed Moore's predictions or come very close.&lt;br /&gt;
&lt;br /&gt;
But there are many intervals where this is not true, including 1971-1978, 1993-2000, which show poor performance. One explanation is that other chip factors started to become more important during this time (power consumption, for example). One also sees certain chips and chip families which cause transistor counts to lurch forward (for example, the first Itanium 2 which included 221 million transistors, 4 times Moore's prediction for 2006). This lurching forward may have an effect that for subsequent years, the hardware outstrips market needs and it takes some time for operating systems and software to catch up. During this period, there is less demand for more powerful chips so market forces require the production of less expensive chips with lower transistor counts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Moore's Law - The Future ==&lt;br /&gt;
&lt;br /&gt;
With a good survey of data and historic context firmly in view, we can now consider whether Moore's Law will hold in future years. Clearly, Moore's predictions concerned transistor counts, not processor speed. But like Knowles, Moore was searching for a simple metric of complexity and performance with respect to cost. His ultimate aim was to predict future performance at costs that were feasible in the marketplace. Hence, on the one hand, one can take Moore's Law to apply to performance versus the specific metric of transistor counts.&lt;br /&gt;
&lt;br /&gt;
Performance is a function of many factor besides transistor counts, including memory management and overall processor architecture (pipelining, cache levels, etc.). Still, in the context of miniaturization and monlithic architectures, it is possible that Moore underestimated the importance of these factors despite his ultimate interest in the sustained future improvement in performance with respect to cost. We must also consider that, as with any new technology, the semi-conductor industry was struggling to convince the engineering community and public of the importance of integrated circuits. The focus was deliberate and simple: adopt integrated circuits as the future.&lt;br /&gt;
&lt;br /&gt;
Those who argue that transistor counts will eventually hit a wall typically do so on the basis of physical limitations. The arguments seem reasonable given that Moore's prediction concerned the number of transistors on a single die and the general unsustainability of exponential growth. Even if the physical limitations were surmounted by better materials and semi-conductor manufacturing processes, it is likely their benefits will be outweighed by alternative innovations at other levels.&lt;br /&gt;
&lt;br /&gt;
The exponential growth predicted by Moore's law is ultimately not sustainable, even as a rough guideline. Even if it were and the above physical limitations are overcome, the resulting transistor count or performance improvement will likely also significantly outrun Moore's predictions since the technology to work with materials at an atomic level will likely lead to completely different architectures Moore could never have predicted.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
*Brock, David C. (2006). &amp;quot;Understanding Moore's Law: four decades of innovation&amp;quot; ''Chemical Heritage Foundation''.&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58206</id>
		<title>CSC/ECE 506 Spring 2012/1b ps</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/1b_ps&amp;diff=58206"/>
		<updated>2012-02-07T03:25:37Z</updated>

		<summary type="html">&lt;p&gt;Psamoue: /* Transistor Counts Over The Years */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Moore's Law ==&lt;br /&gt;
&lt;br /&gt;
In 1965, Intel co-founder predicted that the number of [http://en.wikipedia.org/wiki/Transistor transistors] on a die would double every 24 months. This was a rough predictive statement that has brought Moore acclaim for its reputed accuracy and foresight. This article explores different interpretations of Moore's law, whether it has indeed held true across the years in all significant intervals of time, and whether it will hold true in the future.&lt;br /&gt;
&lt;br /&gt;
The rate of growth Moore predicted is truly staggering. The equation form of the law is: T(t)=T0 * 2^(t/2) where T0 represents the initial transistor count in the start year and T(t) the transistor count in t years. Exponential growth is somewhat lost on transistors so we'll switch to counting something more tangible (and one most everyone understands): money. If one could double her wealth every 24 months and started with $1, she would have $5,931,641 in 45 years!&lt;br /&gt;
&lt;br /&gt;
But this is exactly what Intel claims it has done with the transistor density of their processors (and certainly their bottom line). To a large extent, the company has symbiotically harnessed and fueled the public awareness of Moore's law to its advantage as a marketing device, but it is still valuable to study the context of the prediction (the 60s, the advent of semi-conductor technology and miniaturization) and the reasons Moore believed the prediction would hold and whether he was right.&lt;br /&gt;
&lt;br /&gt;
== Historic Context ==&lt;br /&gt;
&lt;br /&gt;
According to David C. Brock in Understanding Moore's Law: four decades of innovation:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;By 1960 miniaturization was a fundamental issue for semiconductor technology and its industry. It had become, moreover, a central factor in the semiconductor community's discussions surrounding the new integrated circuits that had been touted in 1959 by Texas Instrucments as the first realization of the &amp;quot;monolothic&amp;quot; circuit ideal.&amp;quot; .&amp;lt;ref&amp;gt;Brock, David C. ''Understanding Moore's Law: fource decades of innovation. Chemical Heritage Foundation, 2006, p. 26.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Two ideas are most striking about this statement: The focus on miniaturization and the concept of the &amp;quot;monolothic&amp;quot; circuit ideal.&lt;br /&gt;
&lt;br /&gt;
In the early 1960s, the semi-conductor industry was gravitating towards building [http://en.wikipedia.org/wiki/Integrated_circuit integrated circuits] on wafers of silicon versus discrete transistors for use as components in electronic devices. The arguments presented were largely those based on cost reduction. Integrated or monolothic circuits were cheaper to produce than devices based on connected discrete transistors.&lt;br /&gt;
&lt;br /&gt;
Among the seminal presentations of the time predating Moore's publication of Moore's Law is C. Harry Knowles (manager for Westinghouse's molecular electronics division). Knowles addressed two critical ideas that helps put Moore's Law into proper context: &lt;br /&gt;
Knowles argued that with technological progress, devices could produce integrated circuits with greater functionality and complexity at higher [http://en.wikipedia.org/wiki/Yield yields] (measure of output). (Brock). Secondly, Knowles brought attention to the issue of performance as a function of size.&lt;br /&gt;
&lt;br /&gt;
== Transistor Counts Over The Years ==&lt;br /&gt;
&lt;br /&gt;
The stage was set: smaller was faster and miniaturized, integrated circuits were cheaper. Moore made several contributions to Knowles in his presentation of Moore's Law, including a significant simplification of ideas to transistor counts over time. &lt;br /&gt;
&lt;br /&gt;
If Moore was right, he wasn't right on the first try. His initial prediction was that transistor counts would double every year. In 1975 he revised this prediction to every two years &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://news.cnet.com/Myths-of-Moores-Law/2010-1071_3-1014887.html &amp;quot;Kanellos, Michael. Perspective: Myths of Moore's Law&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If we take 1971 and the Intel 4004 processor as our starting point &amp;lt;ref&amp;gt;&lt;br /&gt;
[http://en.wikipedia.org/wiki/Transistor_count &amp;quot;Microprocessors&amp;quot;]&lt;br /&gt;
&amp;lt;/ref&amp;gt;, we see the following growth alongside the predicted growth and the variance and variance percentage next to each&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| {{table}}&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Processor'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Transistor count'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Year'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Predicted Value'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Variance'''&lt;br /&gt;
| align=&amp;quot;center&amp;quot; style=&amp;quot;background:#f0f0f0;&amp;quot;|'''Percent Error (Variance/Predicted Transistor Count)'''&lt;br /&gt;
|-&lt;br /&gt;
| Intel 4004||2,300||1971||2,300||N/A||&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8008||3,500||1972||3,253||247||8%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6800||4,100||1974||6,505||-2,405||-37%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8080||4,500||1974||6,505||-2,005||-31%&lt;br /&gt;
|-&lt;br /&gt;
| RCA 1802||5,000||1974||6,505||-1,505||-23%&lt;br /&gt;
|-&lt;br /&gt;
| MOS Technology 6502||3,510||1975||9,200||-5,690||-62%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8085||6,500||1976||13,011||-6,511||-50%&lt;br /&gt;
|-&lt;br /&gt;
| Zilog Z80||8,500||1976||13,011||-4,511||-35%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 6809||9,000||1978||26,022||-17,022||-65%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8086||29,000||1978||26,022||2,978||11%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 8088||29,000||1979||36,800||-7,800||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Motorola 68000||68,000||1979||36,800||31,200||85%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80186||55,000||1982||104,086||-49,086||-47%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80286||134,000||1982||104,086||29,914||29%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80386||275,000||1985||294,400||-19,400||-7%&lt;br /&gt;
|-&lt;br /&gt;
| Intel 80486||1,180,000||1989||1,177,600||2,400||0%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium||3,100,000||1993||4,710,400||-1,610,400||-34%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K5||4,300,000||1996||13,323,023||-9,023,023||-68%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium II||7,500,000||1997||18,841,600||-11,341,600||-60%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6||8,800,000||1997||18,841,600||-10,041,600||-53%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium III||9,500,000||1999||37,683,200||-28,183,200||-75%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K6-III||21,300,000||1999||37,683,200||-16,383,200||-43%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K7||22,000,000||1999||37,683,200||-15,683,200||-42%&lt;br /&gt;
|-&lt;br /&gt;
| Pentium 4||42,000,000||2000||53,292,093||-11,292,093||-21%&lt;br /&gt;
|-&lt;br /&gt;
| Barton||54,300,000||2003||150,732,800||-96,432,800||-64%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K8||105,900,000||2003||150,732,800||-44,832,800||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2||220,000,000||2003||150,732,800||69,267,200||46%&lt;br /&gt;
|-&lt;br /&gt;
| Itanium 2 with 9MB cache||592,000,000||2004||213,168,370||378,831,630||178%&lt;br /&gt;
|-&lt;br /&gt;
| Cell||241,000,000||2006||426,336,740||-185,336,740||-43%&lt;br /&gt;
|-&lt;br /&gt;
| Core 2 Duo||291,000,000||2006||426,336,740||-135,336,740||-32%&lt;br /&gt;
|-&lt;br /&gt;
| Dual-Core Itanium 2||1,700,000,000||2006||426,336,740||1,273,663,260||299%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||463,000,000||2007||602,931,200||-139,931,200||-23%&lt;br /&gt;
|-&lt;br /&gt;
| POWER6||789,000,000||2007||602,931,200||186,068,800||31%&lt;br /&gt;
|-&lt;br /&gt;
| Atom||47,000,000||2008||852,673,480||-805,673,480||-94%&lt;br /&gt;
|-&lt;br /&gt;
| AMD K10||758,000,000||2008||852,673,480||-94,673,480||-11%&lt;br /&gt;
|-&lt;br /&gt;
| Core i7 (Quad)||731,000,000||2008||852,673,480||-121,673,480||-14%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Xeon 7400||1,900,000,000||2008||852,673,480||1,047,326,520||123%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Opteron 2400||904,000,000||2009||1,205,862,400||-301,862,400||-25%&lt;br /&gt;
|-&lt;br /&gt;
| 16-Core SPARC T3||1,000,000,000||2010||1,705,346,960||-705,346,960||-41%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Gulftown)||1,170,000,000||2010||1,705,346,960||-535,346,960||-31%&lt;br /&gt;
|-&lt;br /&gt;
| 8-core POWER7||1,200,000,000||2010||1,705,346,960||-505,346,960||-30%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-core z196[3]||1,400,000,000||2010||1,705,346,960||-305,346,960||-18%&lt;br /&gt;
|-&lt;br /&gt;
| Quad-Core Itanium Tukwila||2,000,000,000||2010||1,705,346,960||294,653,040||17%&lt;br /&gt;
|-&lt;br /&gt;
| 8-Core Xeon Nehalem-EX||2,300,000,000||2010||1,705,346,960||594,653,040||35%&lt;br /&gt;
|-&lt;br /&gt;
| Six-Core Core i7 (Sandy Bridge-E)||2,270,000,000||2011||2,411,724,800||-141,724,800||-6%&lt;br /&gt;
|-&lt;br /&gt;
| 10-Core Xeon Westmere-EX||2,600,000,000||2011||2,411,724,800||188,275,200||8%&lt;br /&gt;
|-&lt;br /&gt;
| &lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Data Review ==&lt;br /&gt;
&lt;br /&gt;
So it's quite clear that the &amp;quot;law&amp;quot; is not to be taken too literally. It's a general marker for what to expect in upcoming years. One sees actually that the number jumps in several years to catch up (by 2011, the variance is only %8 with the 10-Core Xeon chip). The Dual Core Itanium also lurches forward nearly %300.&lt;br /&gt;
&lt;br /&gt;
The following chart shows the actual and predicted transistor counts:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:TransistorCountVariance.JPG]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Variance Years ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
We see that there are probably exceptions which should be taken out of the data; for example, the Atom processor, which is an ultra-low voltage processor embedded in netbooks. These are special purpose processors that do not reflect the state of the technology at the time (for example, the Intel Core i7 chip in 2008, the same year as the Atom chip, is closer to the mark at a variance of only -14%, versus the Atom's -94%.&lt;br /&gt;
&lt;br /&gt;
So it is important when evaluating whether Moore's law has failed to hold to compare chips of the same family, or perhaps take the best technology for a given year. For example, in years 1979, 1982, 1985, 1989 with the Motorola 68000, Intel 80286, Intel 80386 and Intel 80486, the best chips available in the data above, either exceed Moore's predictions or come very close.&lt;br /&gt;
&lt;br /&gt;
But there are many intervals where this is not true, including 1971-1978, 1993-2000, which show poor performance. One explanation is that other chip factors started to become more important during this time (power consumption, for example). One also sees certain chips and chip families which cause transistor counts to lurch forward (for example, the first Itanium 2 which included 221 million transistors, 4 times Moore's prediction for 2006). This lurching forward may have an effect that for subsequent years, the hardware outstrips market needs and it takes some time for operating systems and software to catch up. During this period, there is less demand for more powerful chips so market forces require the production of less expensive chips with lower transistor counts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Moore's Law - The Future ==&lt;br /&gt;
&lt;br /&gt;
With a good survey of data and historic context firmly in view, we can now consider whether Moore's Law will hold in future years. Clearly, Moore's predictions concerned transistor counts, not processor speed. But like Knowles, Moore was searching for a simple metric of complexity and performance with respect to cost. His ultimate aim was to predict future performance at costs that were feasible in the marketplace. Hence, on the one hand, one can take Moore's Law to apply to performance versus the specific metric of transistor counts.&lt;br /&gt;
&lt;br /&gt;
Performance is a function of many factor besides transistor counts, including memory management and overall processor architecture (pipelining, cache levels, etc.). Still, in the context of miniaturization and monlithic architectures, it is possible that Moore underestimated the importance of these factors despite his ultimate interest in the sustained future improvement in performance with respect to cost. We must also consider that, as with any new technology, the semi-conductor industry was struggling to convince the engineering community and public of the importance of integrated circuits. The focus was deliberate and simple: adopt integrated circuits as the future.&lt;br /&gt;
&lt;br /&gt;
Those who argue that transistor counts will eventually hit a wall typically do so on the basis of physical limitations. The arguments seem reasonable given that Moore's prediction concerned the number of transistors on a single die and the general unsustainability of exponential growth. Even if the physical limitations were surmounted by better materials and semi-conductor manufacturing processes, it is likely their benefits will be outweighed by alternative innovations at other levels.&lt;br /&gt;
&lt;br /&gt;
The exponential growth predicted by Moore's law is ultimately not sustainable, even as a rough guideline. Even if it were and the above physical limitations are overcome, the resulting transistor count or performance improvement will likely also significantly outrun Moore's predictions since the technology to work with materials at an atomic level will likely lead to completely different architectures Moore could never have predicted.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
*Brock, David C. (2006). &amp;quot;Understanding Moore's Law: four decades of innovation&amp;quot; ''Chemical Heritage Foundation''.&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&amp;lt;references&amp;gt;&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Psamoue</name></author>
	</entry>
</feed>