Virtual memory

Virtual memory

Virtual memory is just another level in the memory hierarchy.

It allows main memory to cache pages (blocks) normally stored on disk.

As with caches, the operations performed by virtual memory are transparent to properly-running user programs.

Similarity to caching.
Block = page

Blocks in caches are equivalent to pages in virtual memory.

Pages are anywhere from 1 KB to 64 KB (though today's page sizes are usually 4+ KB).

Miss = page fault

A miss in a cache is analogous to a page fault.
The only difference is the penalty.

Millions of clock cycles for VM as compared to tens of clock cycles for caches.

Virtual memory

Similarity to caching.
Miss rate

The miss rate for VM is very low -- less than 0.001%.
This means that fewer than one in one million accesses cause a VM miss, and it's often a lot fewer.

Size

Caches are 16 KB - 1 MB or more.
The VM "cache" is 16 MB to 1024 MB or more -- a factor of 1000 larger.

Differences include:
Replacement mechanism.

In caches, it is primarily controlled by the hardware.
In VM, replacement is primarily controlled by the OS.

The number of bits in the address determines the size of VM where cache size is independent of the address size.

Two classes of VM: paging systems and segmentation systems.

Basic virtual memory caching questions

Where can a block be placed ?

Since miss penalties are very high, OS designers always choose lower miss rates over simple placement algorithms.
Therefore, VM is almost always fully-associative (blocks can be placed anywhere in main memory).

How is a block found ?

Paging systems use a page table to translate virtual page numbers into physical page numbers.

The physical address is constructed by concatenating the physical page number (found in the table) to the offset.

Segmented systems use a similar structure except that the segment's physical address is ADDED to the offset.

Note that the page table needs enough entries to map the entire virtual address space since it is accessed using virtual page numbers.

Basic virtual memory caching questions

How is a block found ?

This can result in a large amount of space dedicated just to the page table.

One optimization is to use hashing to restrict the number of page table entries to the number of physical pages.

This is called an inverted page table .

Translation lookaside buffers (TLBs) are used to cache these translations, and reduce address translation time.

Which block is replaced ?

Most operating systems use LRU or an approximation to it.

The page table often includes a reference bit to help do LRU replacement.

Basic virtual memory caching questions

What happens on a write ?

VM is always writeback (capture as many writes as possible before writing the page to disk).

Write-through does not make sense because of the very large access penalty.

Thus, the page table uses a dirty bit to keep track which pages have been modified and must be written to disk before they are replaced.

We do not want to write pages to disk that have not been modified.

Page tables imply that a memory reference requires two memory accesses.

One for the page table and one to get the data.

A TLB, which caches previous translations, can be effective in reducing memory references to the page table.

This works because of the principle of locality.

Translation Look-aside Buffer

Similar to a cache:

Tag holds the virtual address
Data portion holds the physical page frame number, protection field, valid bit, use bit and a dirty bit.

Translation Look-aside Buffer

As with normal caches, the TLB may be fully-associative, direct-mapped, or set-associative.

Replacement may be done in hardware or may be assisted by software.

For example, a miss in the TLB causes an exception which is handled by the OS, which places the appropriate page information into the TLB.

Hardware handling is faster, but software is more flexible.

Small, fast TLBs are crucial because they are on the critical path to accessing data from the cache.

This is particularly true if the cache is physically addressed.

Selecting a Page Sizes

Large page sizes are generally better because:

They reduce the size of the page table.
They are more efficient to transfer between memory and disk.
They allow a TLB to cache translations for more of memory.

The biggest drawback to large pages is that they may waste memory, internal fragmentation .

Assuming a process has three primary segments (text, heap and stack), the average wasted storage per process will be 1.5 times the page size.

When page size is 4 KB or 8 KB, this is negligible for machines with megabytes of memory.

For larger pages, e.g., 64 KB, lots of storage may be wasted.

Uses of Virtual Memory

Protection

VM is often used to protect one program from others in the system.
Protection mechanisms must have hardware support.

Base & bounds

Each reference must fall between two addresses, given by the base & bound registers.
This method also allows some relocation.

User processes cannot be allowed to change these registers, but the OS must be able to do so on a process switch.

Therefore, the hardware must be able to:

Provide at least two modes of operations, user and kernel mode and a mechanism to switch between them.
Provide a protection mechanism for other portions of the CPU state to prevent user processes from being malicious.

User/supervisor mode bit(s).
Interrupt enable/disable bit(s).

Uses of Virtual Memory

Protection

Base and bound registers constitute the minimum protection system.

Virtual Memory offers a more fine-grained alternative.

Processes have their own page tables, which they cannot modify themselves.

Permission flags are provided with each segment or page.

Concentric rings of security and capability lists are more fined-grained alternatives, allowing more than two levels of protection.

The OS course discusses these in detail.

Effects of CPU design on memory hierarchy

Superscalar & vector execution

A superscalar or vector machine may fetch several words per cycle.

Clearly, the memory system must deliver the bandwidth to handle this, otherwise the benefit is lost.

The brunt of the load falls upon the L1 cache.

Bandwidth can be increased by widening the path to the cache or by providing extra ports to the cache.

However, cache access is often the bottleneck in modern CPUs.

Speculative execution

Speculative execution and conditional instructions may generate invalid addresses that would not occur otherwise.

The memory system must recognize and suppress these exceptions.

Similarly, it must not stall the cache on a miss caused by a speculative instruction.

Effects of CPU design on memory hierarchy

I/O and cache consistency

I/O devices move data from peripherals to memory.

This has two pitfalls:

Data written into memory is not automatically updated in the cache.
Data in a writeback cache is not written to memory immediately so memory has stale data.

One solution is to flush blocks from the cache that are used in the I/O operation.
This is done:

Before the I/O for a write (so the write operation uses up-to-date information).
After the I/O for the read (before the I/O should work as well. The CPU should not access the data as it is being read into memory).

An alternate method is simply to mark the blocks from I/O buffers as uncacheable .

Effects of CPU design on memory hierarchy

I/O and cache consistency

Other solutions include:

Watch the I/O buses for addresses in the tag.

This eliminates the consistency problem.
The drawback is that the checking slows down the cache.

Do I/O directly into the cache.

This method guarantees consistency but it slows down the cache since both the CPU and I/O access it.

Moreover, it displaces data in the cache with new data that is unlikely to be accessed soon by the CPU.

Fallacies and Pitfalls

Don't predict cache performance of program A from program B .

Programs vary widely in how they use cache.

A scientific program may have a small tight code loop but access large quantities of data.

On the other hand, a word processing program might operate on relatively little data but use lots of code.

Simulate plenty of memory references .

A CPU executes 100 million or more instructions per second.

Simulating cache behavior using traces of only a few million traces can be misleading.

Particularly since program locality behavior is not constant over the run of the entire program.

Don't ignore the OS.

The OS can miss or interfere with application programs, causing misses.