Speeding Up Paging

Speeding Up Paging

As we have just studied the basics of virtual memory and paging, it is now time to go into more detail about possible implementations. In any paging system, two main issues must be faced:

1. The mapping from virtual address to physical address must be fast.
2. If the virtual address space is large, the page table will be large.

The first point is a result of the fact that the virtual-to-physical mapping must be done on every memory reference. All instructions must eventually come from memory and many of them reference operands in memory as well. Therefore, it is necessary to make one, two, or sometimes more page table references per instruction. If an instruction execution takes, say, 1 nsec, the page table lookup must be done in under 0.2 nsec to avoid having the mapping become a major bottleneck.

The second point follows from the fact that all modern computers use virtual addresses of at least 32 bits, with 64 bits becoming gradually more common. With, say, a 4-KB page size, a 32-bit address space has 1 million pages, and a 64-bit address space has more than you want to consider. With 1 million pages in the virtual address space, the page table must have 1 million entries, and remember that each process needs its own page table (because it has its own virtual address space).

The need for large, fast page mapping is a major restriction on the way computers are built. The simplest design (at least  conceptually) is to have a single page table consisting of an array of fast hardware registers, with one entry for each virtual page, indexed by virtual page number, as shown in "Paging" Figure 3. When a process is started up, the operating system loads the registers with the process page table, taken from a copy kept in main memory. During process execution, no more memory references are required for the page table. The advantages of this method are that it is straightforward and requires no memory references during mapping. A disadvantage is that it is unbearably expensive if the page table is large. Another is that having to load the full page table at every context switch hurts performance.

At the other extreme, the page table can be entirely in main memory. All the hardware needs then is a single register that points to the start of the page table. This design allows the virtual-to-physical map to be changed at a context switch by reloading one register. Certainly, it has the disadvantage of requiring one or more memory references to read page table entries during the execution of each instruction, making it very slow.

Translation Lookaside Buffers

We shall now consider widely implemented schemes for speeding up paging and for handling large virtual address spaces, starting with the earlier. The starting point of most optimization techniques is that the page table is in memory. Potentially, this design has a massive impact on performance. Consider, for instance, a 1-byte instruction that copies one register to another. In the absence of paging, this instruction makes only one memory reference, to fetch the instruction. With paging, at least one additional memory reference will be required, to access the page table. Since execution speed is normally limited by the rate at which the CPU can get instructions and data out of the memory, having to make two memory references per memory reference reduces performance by half. Under these conditions, no one would use paging.

Computer designers have known about this problem for years and have come up with a solution. Their solution is based on the observation that most programs tend to make a large number of references to a small number of pages, and not the other way around. Therefore only a small fraction of the page table entries are heavily read; the rest are barely used at all.

The solution that has been devised is to equip computers with a small hardware device for mapping virtual addresses to physical addresses without going through the page table. The device, called a TLB (Translation Lookaside Buffer) or sometimes an associative memory, is shown in Figure1. It is generally inside the MMU and comprises a small number of entries, eight in this example, but rarely more than 64. Each entry includes information about one page, containing the virtual page number, a bit that is set when the page is customized, the protection code (read/write/execute permissions), and the physical page frame in which the  page is located. These fields have a one-to-one correspondence with the fields in the page table, except for the virtual page number, which is not required in the page table. Another bit indicates whether the entry is valid (i.e., in use) or not.

A TLB to speed up paging

An instance that might generate the TLB of Figure 1 is a process in a loop that spans virtual pages 19, 20, and 21, so that these TLB entries have protection codes for reading and executing. The main data currently being used (say, an array being processed) are on pages 129 and 130. Page 140 includes the indices used in the array calculations. Eventually, the stack is on pages 860 and 861.

Let us now examine how the TLB functions. When a virtual address is presented to the MMU for translation, the hardware first checks to see if its virtual page number is present in the TLB by comparing it to all the entries simultaneously (i.e., in parallel). If a valid match is found and the access does not violate the protection bits, the page frame is taken directly from the TLB, without going to the page table. If the virtual page number is present in the TLB but the instruction is trying to write on a read-only page, a protection fault is generated.

The interesting case is what happens when the virtual page number is not in the TLB. The MMU detects the miss and does an ordinary page table lookup. It then expels one of the entries from the TLB and replaces it with the page table entry just looked up. In this way if that page is used again soon, the second time it will result in a TLB hit rather than a miss. When an entry is purged from the TLB, the modified bit is copied back into the page table entry in memory. The other values are already there, except the reference bit. When the TLB is loaded from the page table, all the fields are taken from memory.

Software TLB Management

So far, we have assumed that every machine with paged virtual memory has page tables recognized by the hardware, plus a TLB. In this design, TLB management and handling TLB faults are done completely by the MMU hardware. Traps to the operating system take place only when a page is not in memory.

In the past, this assumption was true. On the other hand, many modern RISC machines, including the SPARC, MIPS, and HP PA, do nearly all of this page management in software. On these machines, the TLB entries are explicitly loaded by the operating system. When a TLB miss occurs, instead of the MMU just going to the page tables to find and fetch the required page reference, it just generates a TLB fault and tosses the problem into the lap of the operating system. The system must find the page, remove an entry from the TLB, enter the new one, and restart the instruction that faulted. And, certainly, all of this must be done in a handful of instructions because TLB misses take place much more frequently than page faults.

Amazingly enough, if the TLB is reasonably large (say, 64 entries) to reduce the miss rate, software management of the TLB turns out to be acceptably efficient. The main gain here is a much simpler MMU, which frees up a substantial amount of area on the CPU chip for caches and other features that can improve performance. Software TLB management is discussed by Uhlig et al. (1994).

Different strategies have been developed to improve performance on machines that do TLB management in software. One approach attacks both reducing TLB misses and reducing the cost of a TLB miss when it does take place (Bala et al., 1994). To reduce TLB misses, sometimes the operating system can use its intuition to figure out which pages are likely to be used next and to preload entries for them in the TLB. For instance, when a client process sends a message to a server process on the same machine, it is very likely that the server will have to run soon. Knowing this, while processing the trap to do the send, the system can also check to see where the server's code, data, and stack pages are and map them in before they get a chance to cause TLB faults.

The normal way to process a TLB miss, whether in hardware or in software, is to go to the page table and perform the indexing operations to locate the page referenced. The problem with doing this search in software is that the pages holding the page table may not be in the TLB, which will cause additional TLB faults during the processing. These faults can be reduced by maintaining a large (e.g., 4-KB) software cache of TLB entries in a fixed location whose page is always kept in the TLB. By first checking the software cache, the operating system can considerably reduce TLB misses.

When software TLB management is used, it is necessary to understand the difference between two kinds of misses. A soft miss takes place when the page referenced is not in the TLB, but is in memory. All that is required here is for the TLB to be updated. No disk I/O is required. Normally a soft miss takes 10-20 machine instructions to handle and can be completed in a few nanoseconds. On the contrary, a hard miss takes place when the page itself is not in memory (and of course, also not in the TLB). A disk access is needed to bring in the page, which takes several milliseconds. A hard miss is easily a million times slower than a soft miss.


Tags

virtual memory, paging, virtual address, tlb, associative memory, virtual page